August 2017 – Rik van Achterberg

If your goal is web scraping using Python, there are several great libraries you can use, like RoboBrowser, BeautifulSoup, and most notably, Scrapy. The downside of all those libraries is their lack of support for JavaScript. If any of the pages you want to scrape depend on JavaScript code execution, you are out of luck. This is where Selenium (or: WebDriver) comes in handy.

Selenium is basically a browser automation framework. With Selenium, you can programmatically control a browser. I big use case for this is UI testing or integration testing, but it’s just as well usable for scraping.

My first (personal) project using Selenium was building an automated bidding robot for certain online bidding sites. After that I’ve used Selenium on several custom scraping projects. I’ll share some insights I’ve gained from these projects.

Which browser to use

Assuming you are running Linux, the two main choices are Mozilla Firefox and Google Chrome (or Chromium). A third option is Opera (but let’s be honest, who still uses Opera).
Both Firefox and Chrome support WebDriver, but there are some caveats.

Google Chrome (and Chromium) contains a bug that renders the browser instance unusable after a TimeoutError is thrown. This might sound sensible, but be aware that a timeout can occur any time a web page does not reach 100% readiness. Any JS still executing? Timeout. Waiting for (external) assets? Restart the browser. Firefox, on the other hand, will raise the same exception, but it can safely be ignored.
From Firefox 48 on, Mozilla have dropped WebDriver support… Mozilla is working on their own builtin alternative, called Marionette. Mozilla offers an adapter-like tool called Geckodriver which allows you to use Marionette through the WebDriver interface, but (currently) the support is far from complete. One option is to use Firefox 45 ESR (long term support) instead. Another option is to use native Marionette instead, but the client does not support Python 3 (a huge sin, in my opinion).
Do you want to run a headless browser? You can use PhantomJS with Ghostdriver. Unfortunately, PhantomJS is not being actively maintained, so you risk using an outdated JS engine.

If you are using Firefox, there are a few tricks to make it run faster:

profile = webdriver.FirefoxProfile()

# Prevent loading of default page
profile.set_preference("browser.startup.homepage_override.mstone", "ignore")
profile.set_preference("startup.homepage_welcome_url.additional", "about:blank")

# Don't load images
profile.set_preference("permissions.default.image", 2)

# Prevent waiting for full page loading
profile.set_preference("webdriver.load.strategy", "fast")

# Increase persistent connections
profile.set_preference("network.http.max-persistent-connections-per-proxy", 255)
profile.set_preference("network.http.max-persistent-connections-per-server", 255)
profile.set_preference("network.http.max-connections", 255)

# Kill long-running scripts
browser.set_script_timeout(10)

Chrome has built-in Flash support. If for some reason you need Flash (I hope you won’t), when using Firefox you’ll have to explicitly install it.

Running headless

Or: Running Selenium on a server. As mentioned before, PhantomJS is a headless browser that can be used with Selenium. However, it’s pretty trivial to run Firefox or Chrome headlessly on Linux using Xvfb (X virtual framebuffer).

You can set this up manually, or use the handy Python wrapper PyVirtualDisplay. Check out this example code:

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1920, 1080))
display.start()
# A virtual display is now running on a free port like 1001 (consult the display.display property to check the actual screen port).
# A new browser instance will automatically use the first available display, so just start a browser.
# Don't forget to stop the display after usage
display.stop()

If you want to check out what is actually going on inside your display, there are several options.

Option 1: Take a screenshot of the virtual display using ImageMagick’s “import” tool:

import subprocess
subprocess.check_call(["/usr/bin/import", "-window", "root", "-display", ":" + str(display.display), "-screen", "/path/to/output.png"])

Option 2: Run a VNC server inside the virtual display, like x11vnc:

apt-get -y install x11vnc
x11vnc -forever -display :1001

Now use a VNC client to connect to localhost:5900.

To Be Continued.

Month: August 2017

Scraping with Selenium and Python