This is an update to my previous post about running RSelenium for Mac. Selenium is a piece of software to enable web designers to test out their code against different browsers. I want to use it for web-scraping, that is, automating the process of getting data from websites. In my case, I want to scrape repositories of regulatory information. Since I want to complete analysis of the data I scrape within the R environment, I would like to run Selenium from within R. There’s a package for that: RSelenium.
I had this all up and running perfectly a year ago, but I came to use it yesterday and everything has changed! It took me several hours to figure out the solution, I thought I would document it here and save someone else the trouble.
If you update to the latest version of RSelenium you will discover that the directory that used to contain the standalone Selenium server is empty except for a text file. The content of this text file is “Dummy directory for user to install selenium-server-standalone jar.” Don’t bother going and getting the standalone server software, there is a better way: Docker.
While we are doing updates, I found I had to manually force an update of the caTools and bitops packages. These are packages that RSelenium depends upon, but somehow, don’t get properly updated even if you select “Install Dependencies” when you update RSelenium. When you run library(RSelenium) you may get the following error message:
Error : .onLoad failed in loadNamespace() for ‘RSelenium’, details:
blah, blah, blah
“Reason: Image not found”
“Error: package or namespace load failed for ‘RSelenium'”
If so, uninstall caTools and bitops and install them again.
Docker is a really convenient way of accessing all kinds of server software. I gather the idea is to create containers to run software safely in. But more than that you can pick a particular version and settings that get loaded when you run (instantiate?) the software. When you run a Selenium server, you tell it what type of browser to use, what port, etc. By using a docker and a docker file some or all this is taken care of for you, and you get the latest version of the software to boot.
The most important reason to use the Docker packages is that they include gekko driver and chrome driver, two key pieces of software you need to make the connection between Selenium and the browser. This was never an issue before, so the architecture of Selenium must have changed. I fiddled around for hours trying to run these drivers directly with no luck. This is the way to go.
So, go to Docker, download and install Docker for Mac. It will show up in your applications folder.
Running Selenium Server
If you explore the Selenium Docker Hub you will see a variety of ways of running Selenium using Docker:
The ones I am interested in are either Selenium/standalone-firefox or Selenium/standalone-chrome, and here’s how you get at them:
- Run Docker – a whale icon will appear in the Finder bar. Once it stops dancing around, it’s done. Click the icon to see the green dot and the message “Docker is running”.
- Run a Terminal window on your Mac. You will find this in the Utilities tab under Applications.
- Execute the following command: docker run -d -p 4445:4444 selenium/standalone-firefox
- To check the server is running execute the following command: docker ps
- To shut down the server: docker stop $(docker ps -q)
The entire command sequence above should look something like the image below:
If that’s what you see, you are good to go: both Selenium and the required drivers are up and running. The example I showed is for Firefox, Chrome will be similar.
RSelenium for Mac
To check all is well, run an instance of Selenium as above (except for the “stop” command), then run an instance of R and issue the following commands:
library(RSelenium) remDr <- remoteDriver(port=4445L) remDr$open() remDr$getStatus() remDr$navigate("https://www.google.com/") remDr$getCurrentUrl() remDr$close()
After you shut down the Selenium server (see command above) the R command remDr$getStatus() should just return an error message.
So that brings us up to date on RSelenium. I hope to start publishing a regular post using data from my web-scraping activities – a league table of $AUM for the largest hedge funds filing as RIAs with the SEC. Look out for it!