Friday, September 16, 2011

Notes - Web scrapping ajax obfuscated websites

Im using a perl script that calls selenium server, then I use Beautiful soup to nicely parse the rendered output coming from selenium. Probably using webkit could be done more nicely but this way is easy and quick to setup.
Just wanted to write this note so I don't forget the pipes that I'm doing. :)

Really it is not needed to get data first with perl, rc selenium export to python (Remote Control) it is enough later to get the source and parse it.

To get the source code in python use:
data = sel.get_html_source()

if you're using perl:
my $data=$sel->get_html_source();

1 comment:

Website Spider Software said...

Hello Dude,

Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but it’s not easy to do. Thanks.....


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.