Web Scraping with JDK 8 ScriptEngine (Nashorn) and Scala

Web scraping is the process of extracting entities from web pages. This entities can either be news articles, blog posts, products, or any other information displayed on the web. Web pages consist of HTML (a XML like structure) and therefor present information in a structured form, in a DOM (Document Object Model), which can be extracted. The structure is usually not very strict or complaint to other web pages and object to change as the web page evolves. Using scripts over compiled programming languages has therefor a lot of advantages. On the other hand compiled languages have often a speed advantage as well as a broad foundation of fundamental libraries. The DOM reference implementation of the W3C is written in Java for example.

Java’s approach of opening up the JVM (Java Virtual Machine) for other (dynamically typed) languages is a great way to combine the benefits of both, a compiled and scripted language. With JDK 8 compiling dynamic languages to the JVM has become simpler with potentially improved implementations of compilers and runtime systems through the invokedynamic instruction. I found that this presentation currently explains best what invokedynamic is and how it works in general.

The language of the web is without doubt JavaScript. Fortunately the JDK 8 comes with a JavaScript implementation, which is called Nashorn and makes use of JSR 292 and invokedynamic. If you download the jdk you can simply get started with the jjs (JavaJavaScript) bin which is a REPL for Nashorn. To learn more about Nashorn you can read here, here, and here. In this post I would like to demonstrate how to use it for web scraping.

For this demonstration we’ll extract products from the BestBuy web page. The heavy lifting will actually be performed by Jsoup a Java HTML Parser. Jsoup has no third party dependencies and we’ll use it to fetch and parse the HTML. Since Nashorn is part of the JDK and Jsoup has not third party dependencies we can keep our whole approach very light weighted.

Nashorn and JavaJavaScript

To extract laptops from BestBuy we will use a JavaScript extractor script. Later the model class and the interface for the extractor will be defined in Scala and Java.  By doing this we can perfectly describe and test how a extractor should work and what it’s supposed to return while still using a dynamic language for extraction.

To get started with Scala, Jsoup and Nashorn we can use Scala’s REPL. We just include Jsoup into the class path and write a first JavaScript to get our self acquainted to the way it all works.

Here we simply launch the scala interpreter and write our first JavaScript. As mentioned earlier we are going to use Jsoup to fetch and parse the HTML from BestBuy‘s web page. This will give us an HTML Document object which we can use to extract elements by using CSS’s selectors for example. The Document object can be injected into our JavaScript and used by converting our engine into a javax.script.Invocable:

We now have created our first extracting script to be used with Jsoup and Scala. This was a very simple and rapid approach with little dependencies needed. To make this all a little bit more robust an useful we are going to introduce a interface for the extractor and a model class to make the extraction more accountable and well defined.

Currently a function like the one above could return anything, a plain JavaScript object, a string, etc. Developers could structure the script in almost anyway possible. Having multiple web pages and multiple developers this can become a very tedious thing to maintain and test.

For this purpose we will create an extractor interface describing the methods we would expect from a script – written in any possible language, like Python, Ruby, and Scala (with 2.11 Scala itself is JSR-223 complient) – and use POJO to specify what an extractor should return. Evaluating prior to execution gives us even more accountability besides writing proper tests.

We are not going to over-complicate things and just define two methods. One method to return pagination URLs and on method to return a list of laptops. The extractor interface could look like this:

A possible script we could use is the following written in JavaScript:

What is still missing and we are going to define next is our laptop POJO. From the example script above you can already see that this POJO also is going to very simple. Let’s have a look:

Putting it All Together

Let’s now put this all together and use javax.script.Invocable to create from our extractor script a perfectly defined extractor object we can use to scrap laptops from BestBuy. We will use Scala’s REPL once again:

As you can see we’ve created a valid instance of a  LaptopExtractor trait and returning a list of POJO’s from the BestBuy web page using JavaScript. This approach also builds upon a robust library – Jsoup. We also used Scala’s REPL to rapidly prototype our extraction. To sum it up:

Further Readings