I have been doing research on the largest hedge funds by ploughing through their regulatory filings. The research itself will be the basis of some future posts. The manual process was so time-consuming that I decided to look into automating it using tools available in R. Basically, I wanted to set up a file with a list of firms and have a script run through each firm and download the webpage containing the filing for that firm and parse it for the information I want.
Important Note: 2016-01-15
The setup and installation information in this article is out of date as a result of the development of Selenium version 3. The description of useful functions and the general approach is still applicable. Please visit the updated article.
This lead me into the fascinating world of web scraping.
All browsers allow you to view a web page in a raw or close to raw form. In Chrome you right-click and select “View page source” to see the raw html code, or “Inspect” to get a much more useful and interactive insight into the elements of the page. If you are attempting to scrape websites, you will spend a lot of time in “Inspect” to find names of various elements and the underlying structure of the page.
Using the Tools Available in the Basic Installation of R
To start out I was using the IAPD website (IAPD Search) where the public can view every RIA’s current ADV filing. R includes download.file() in the “util” package, which is capable of downloading the html code from any given web address. This will look like what you see when you right-click a webpage in Chrome and choose “View Page Source”. You can then use R’s limited string handling functions to find tags (such as tables, table rows, table data) and tease out the data you want.
It’s clumsy, but doable. I wrote a series of functions that enabled me to quickly find tables and pull out a particular row, or column or the contents of a particular cell. One of the issues you have to come to terms with is that the SEC presents ADVs as a set of nested tables. So you get as close as you can using some key piece of text such as a heading for a section of the form, and then burrow down to the table you really want. This makes for fragile scripts that break down each time they meet a new situation.
Using XML Package
The XML package is big and powerful (the documentation runs 170+ pages). Not only can you use it to inspect html/xml code, but you can build html/xml as well. I assume this is to support running R code on a website and building the results into the webpage on the fly.
I do not have sufficient knowledge of the inner workings of xml, html, DOM, DTD, SAX, catalogs, Solr, etc., to understand the full power of this package, but I was able to explore a few of the functions that were useful for the task at hand:
However, I can’t seem to locate any functions for sending information to a form or for remotely driving a web site. This means you have to construct URLs and go to the page rather than locating and clicking on links.
Unless I am missing something, this package is not really designed with web-scraping in mind.
RSelenium is a package with a set of functions that allow you to take control of a browser. You can create a script that runs through a set of user actions just as though you were operating a web browser to view and explore a web site. I think the intent was for it to be used for scraping and for testing web pages.
You install the package just as you would any other package. To get it to work effectively on a Mac, here is what I do: Find the R installation in the Mac library (click Option + Go in finder to reveal the hidden library directory). You need the library under Macintosh HD rather than the one under Users. Drill down via /Library/Frameworks/R.framework/Versions/3.2/Resources/library/RSelenium/bin to selenium-server-standalone.jar and double-click it. This runs the selenium server in standalone mode. There is a start server function, but on the Mac this doesn’t seem to work.
Then in your script start out with:
library("RSelenium") remDr <- remotedriver() message("Opening firefox browser session ...")
This creates an object remDr of class “remoteDriver”. The default is a firefox remoteDriver, and this is what I had most success with on a Mac. There are various other settings you can include in the remoteDriver() function, but unless there is a compelling reason, use the defaults.
Now we are ready to drive a firefox browser remotely. There is a set of functions that can be applied to the remDr object using the form: remDr$function()
So, first off we need to open a browser session:
remDr$open(silent=T) Sys.sleep(2) # give it a moment
You should see an instance of Firefox start up and open a blank page. If so, you are good to go. The switch “silent=T” suppresses messages to the R console. See the documentation for the complete list of available functions, but a few I made extensive use of are as follows:
- remDr$navigate(url) Go to the URL specified – you should see it open in the browser.
- remDr$getCurrentUrl() returns the current URL – useful when you are getting started for de-bugging. Save the results of this call to re-visit a page.
- remDr$refresh() / remDr$goBack() / remDr$goForward() Refresh the current page, go back to the previous URL or forward to the URL you just came back from. I avoid using these as there may be detritus on the page from previous activities (e.g. data in form fields). Better to use getCurrentUrl() and navigate().
- remDr$findElement() Allows you to search the page for various elements such as tags (e.g. using=”tag”, value=”table”), or ids (e.g. using=”id”, value=”tableID”), etc. The first element matching the criteria is returned as an object of class “webElement”
- remDr$findElements() As above but returns a list of all the elements matching the criteria.
Once you have retrieved a web element of interest, there are a set of functions you can apply to it. A simple example should illustrate some of the functionality. Let’s say we want to look for pool P012345 on the NFA’s website using the BASIC search page. This involves:
- Going to the NFA’s BASIC search page
- Finding the appropriate search field
- Entering the pool id into it
- Finding the “Go” button
- Clicking it
Using Chrome’s “Inspect” feature, I determined that the box where you enter the search term has id = “ctl00_cphMain_txtNFAID”, and the “Go” button has id = “ctl00_cphMain_btnSearchNFAID”. I use $findElement() to find these items on the search page, saving them as webElement variables txtNFAIDElem and btnSearchNFAIDElem respectively. Next I get to use $sendKeysToElement() that acts on a webElement to send the NFAID to the <input> field. Finally, I use $clickElement() to click the “Go” button which is an <a> tag. The script looks like this:
searchPageUrl <- "http://www.nfa.futures.org/basicnet/" NFAID <- "P012345" remDr$navigate(searchPageUrl) txtNFAIDElem <- remDr$findChildElement(using="id", value="ctl00_cphMain_txtNFAID") btnSearchNFAIDElem <- remDr$findChildElement(using="id", value="ctl00_cphMain_btnSearchNFAID") txtNFAIDElem$sendKeysToElement(list(NFAID)) btnSearchNFAIDElem$clickElement()
The search results should appear after the $clickElement() function is invoked.
Other useful functions that can be applied to webElement variables include:
- webElement$clearElement() Removes any key strokes sent to webElement.
- webElement$sendKeysToElement() Allows entry of text (including the “enter” character) into an element. I generally found it more reliable to “click” elements than send “return”.
- webElement$clickElement() Send a mouse-click to the element.
- webElement$findChildElement() / webElement$findChildElements() Essentially gives you the tools to find elements within elements.
- webElement$getElementText() Returns as a single string all the text from the beginning to the end of the element.
One of the tricky aspects of using RSelenium is when dealing with nested elements. For example: I have a table nested within a table, and the inner table appears on row 1 of the outer table and has a variable number of rows. I am interested in content of row 2 of the outer table. Since findChildElements(using=”tag”, value=”tr”) gives me all the child rows of the outer table regardless whether they are in the table of interest or the inner table, I have to determine the number of rows in the inner table to figure out which of the child elements is row 2 of the outer table. It would be great if there was a switch that would specify whether or not to ignore the content of child nodes that are the same kind of node as the parent.
If you are having difficulty with lengthy operations or slow websites, these will help:
- remDr$setImplicitWaitTimeout(milliseconds = 10000)
- remDr$setTimeout(type = “page load”, milliseconds = 10000) Can also be set to “script” and “implicit” which does the same as the function above.
Sometimes a website throws up an alert which you have to acknowledge:
- remDr$acceptAlert() For the kinds of alert where there is only an “ok” button.
- remDr$dismissAlert() For alerts generated by confirm() or prompt() this is like clicking “cancel”, for alerts generated by alert() this is like clicking “ok”.
- remDr$getAlertText() allows you to retrieve the text and make your response conditional.
- remDr$sendKeysToAlert(sendKeys) allows you to control your response to an alert.
Once you are finished, you need to close the browser session: remDr$close()
Headless Browsing Using PhantomJS
Headless browsing is an activity where you read the website but don’t spend any processing power rendering it for display on the screen. In theory this makes headless browsing much faster than using a regular browser. RSelenium allows you to use PhantomJS as a headless browser.
I was able to get this to work for simple things, for example going to the StockCharts website and grabbing some data. I also got it to work intermittently with the NFA website. What I found was that I could go through a few cycles and then R would simply hang. I could not find a pattern – if I re-ran the same script it would hang at a different point each time. I tried putting delays in all over the place (defeating the purpose of going headless), but it didn’t help – and I would assume that the code waits for acknowledgements from the website before executing the next line anyway. I tried downloading various versions of PhantomJS as denizens of StackExchange indicated this might help. All to no avail.
If anyone knows how to get PhantomJS to work in the Mac OSX environment, I would love to hear from you.
I could not get RSelenium to work reliably on my Mac using Chrome, my preferred browser.
Internet Explorer Browsing
I did not try out IE.
In terms of web scraping, I don’t think there is anything RSelenium is lacking, other than being able to work with PhantomJS on a Mac. Given that nearly all the time web-scraping is involved in waiting for the page to load, I am not sure it really makes much difference. Unfortunately I can’t actually test that!
One final note. I was wondering what monitoring the SEC and NFA do of the browsing activities on their websites. When I ploughed through the entire NFA pool database, with well over a 100,000 page searches in succession, at the back of my mind I was wondering if this might look like some kind of nefarious activity from their perspective. I haven’t received any calls from them … or seen any black helicopters overhead … yet!