Web Scraping with JDK 8 ScriptEngine (Nashorn) and Scala

Web scraping is the process of extracting entities from web pages. This entities can either be news articles, blog posts, products, or any other information displayed on the web. Web pages consist of HTML (a XML like structure) and therefor present information in a structured form, in a DOM (Document Object Model), which can be extracted. The structure is usually not very strict or complaint to other web pages and object to change as the web page evolves. Using scripts over compiled programming languages has therefor a lot of advantages. On the other hand compiled languages have often a speed advantage as well as a broad foundation of fundamental libraries. The DOM reference implementation of the W3C is written in Java for example.

Java’s approach of opening up the JVM (Java Virtual Machine) for other (dynamically typed) languages is a great way to combine the benefits of both, a compiled and scripted language. With JDK 8 compiling dynamic languages to the JVM has become simpler with potentially improved implementations of compilers and runtime systems through the invokedynamic instruction. I found that this presentation currently explains best what invokedynamic is and how it works in general.

The language of the web is without doubt JavaScript. Fortunately the JDK 8 comes with a JavaScript implementation, which is called Nashorn and makes use of JSR 292 and invokedynamic. If you download the jdk you can simply get started with the jjs (JavaJavaScript) bin which is a REPL for Nashorn. To learn more about Nashorn you can read here, here, and here. In this post I would like to demonstrate how to use it for web scraping.

For this demonstration we’ll extract products from the BestBuy web page. The heavy lifting will actually be performed by Jsoup a Java HTML Parser. Jsoup has no third party dependencies and we’ll use it to fetch and parse the HTML. Since Nashorn is part of the JDK and Jsoup has not third party dependencies we can keep our whole approach very light weighted.

Nashorn and JavaJavaScript

To extract laptops from BestBuy we will use a JavaScript extractor script. Later the model class and the interface for the extractor will be defined in Scala and Java.  By doing this we can perfectly describe and test how a extractor should work and what it’s supposed to return while still using a dynamic language for extraction.

To get started with Scala, Jsoup and Nashorn we can use Scala’s REPL. We just include Jsoup into the class path and write a first JavaScript to get our self acquainted to the way it all works.

%> scala -classpath jsoup-1.7.3.jar
scala> import javax.script.{Invocable, ScriptEngine, ScriptEngineManager}
import javax.script.{Invocable, ScriptEngine, ScriptEngineManager}

scala> val manager: ScriptEngineManager = new ScriptEngineManager
manager: javax.script.ScriptEngineManager = javax.script.ScriptEngineManager@1936f0f5

scala> val engine: ScriptEngine = manager.getEngineByName("nashorn")
engine: javax.script.ScriptEngine = jdk.nashorn.api.scripting.NashornScriptEngine@4ec4f3a0

scala> engine.eval("function hello(){print('Hello');} hello();")
Hello
res0: Object = null

Here we simply launch the scala interpreter and write our first JavaScript. As mentioned earlier we are going to use Jsoup to fetch and parse the HTML from BestBuy‘s web page. This will give us an HTML Document object which we can use to extract elements by using CSS’s selectors for example. The Document object can be injected into our JavaScript and used by converting our engine into a javax.script.Invocable:

scala> import org.jsoup.nodes.Document
import org.jsoup.nodes.Document

scala> import org.jsoup.Jsoup
import org.jsoup.Jsoup

scala> val doc: Document = Jsoup.connect("http://www.bestbuy.com/site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000").get
doc: org.jsoup.nodes.Document =
<!DOCTYPE html>
....

scala> engine.eval("function extractTitle(doc){ print(doc.select('head title').first().html()); }")
res1: Object = function extractTitle(doc){ print(doc.select('head title').first().html()); }

scala> val in: Invocable = engine.asInstanceOf[Invocable];
in: javax.script.Invocable = jdk.nashorn.api.scripting.NashornScriptEngine@3ad83a66

scala> in.invokeFunction("extractTitle", doc);
PC Laptops - Best Buy
res3: Object = null

We now have created our first extracting script to be used with Jsoup and Scala. This was a very simple and rapid approach with little dependencies needed. To make this all a little bit more robust an useful we are going to introduce a interface for the extractor and a model class to make the extraction more accountable and well defined.

Currently a function like the one above could return anything, a plain JavaScript object, a string, etc. Developers could structure the script in almost anyway possible. Having multiple web pages and multiple developers this can become a very tedious thing to maintain and test.

For this purpose we will create an extractor interface describing the methods we would expect from a script – written in any possible language, like Python, Ruby, and Scala (with 2.11 Scala itself is JSR-223 complient) – and use POJO to specify what an extractor should return. Evaluating prior to execution gives us even more accountability besides writing proper tests.

We are not going to over-complicate things and just define two methods. One method to return pagination URLs and on method to return a list of laptops. The extractor interface could look like this:

trait LaptopExtractor {
  def extractLaptops(doc: Document): Array[Laptop] // list of laptops
  def getPagination(doc: Document): Array[String] // list of URLs to fetch next
}

A possible script we could use is the following written in JavaScript:

function extractLaptops(doc){
    var LaptopClass = Java.type("bestbuy.Laptop");
    var laptops = new Array();
    doc.select('div#listView > div.hproduct').forEach(function(l){
        var laptop = new LaptopClass();
        laptop.name = l.select('h3[itemprop=name] > a').first().html();
        laptop.price = parseFloat(l.select('span[itemprop=price]').first().html().replace("$",""));
        laptops.push(laptop);
    });
    return laptops;
}

function getPagination(doc){
    var urls = new Array();
    doc.select('ul.pagination > li > a').forEach(function(l){
        urls.push(l.attr('href'));
    });
    return urls;
}

What is still missing and we are going to define next is our laptop POJO. From the example script above you can already see that this POJO also is going to very simple. Let’s have a look:

package bestbuy;

public class Laptop {
    public String name;
    public Double price;
    public Laptop(){}
    public Laptop(String name, Double price){this.name = name; this.price=price;}
}

Putting it All Together

Let’s now put this all together and use javax.script.Invocable to create from our extractor script a perfectly defined extractor object we can use to scrap laptops from BestBuy. We will use Scala’s REPL once again:

%> scala -classpath jsoup-1.7.3.jar:bestbuy.jar
scala> import bestbuy.Laptop
import bestbuy.Laptop

scala> import org.jsoup.Jsoup
import org.jsoup.Jsoup

scala> import org.jsoup.nodes.Document
import org.jsoup.nodes.Document

scala> import javax.script.{Invocable, ScriptEngine, ScriptEngineManager}
import javax.script.{Invocable, ScriptEngine, ScriptEngineManager}

scala> val manager: ScriptEngineManager = new ScriptEngineManager
manager: javax.script.ScriptEngineManager = javax.script.ScriptEngineManager@725bef66

scala>  val engine: ScriptEngine = manager.getEngineByName("nashorn")
engine: javax.script.ScriptEngine = jdk.nashorn.api.scripting.NashornScriptEngine@31368b99

scala> val script: String = scala.io.Source.fromFile(new java.io.File("~/extractLaptop.js")).mkString
script: String =
"function extractLaptops(doc){...

scala> engine.eval(script)
res4: Object =
function getPagination(doc){...

scala> val in: Invocable = engine.asInstanceOf[Invocable]
in: javax.script.Invocable = jdk.nashorn.api.scripting.NashornScriptEngine@31368b99

scala> trait LaptopExtractor {
     |   def extractLatops(doc: Document): Array[Laptop] // list of laptops
     |   def getPagination(doc: Document): Array[String] // list of URLs to fetch next
     | }
defined trait LaptopExtractor

scala> val lExtractor: LaptopExtractor = in.getInterface(classOf[LaptopExtractor])
lExtractor: LaptopExtractor = LaptopExtractor$$NashornJavaAdapter@5e76a2bb

scala> val doc: Document = Jsoup.connect("http://www.bestbuy.com/site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000").get
doc: org.jsoup.nodes.Document =
<!DOCTYPE html>
...

scala> val laptops:Array[Laptop] = lExtractor.extractLaptops(doc)
...

scala> for(lap <- laptops){ println(lap.name+"t"+ lap.price)}
HP - TouchSmart 15.6&quot; Touch-Screen Laptop - 4GB Memory - 750GB Hard Drive - Sparkling Black    379.99
Toshiba - Satellite 15.6&quot; Laptop - Intel Celeron - 4GB Memory - 500GB Hard Drive - Satin Black 249.99
HP - Pavilion 17.3&quot; Laptop - 4GB Memory - 750GB Hard Drive - Anodized Silver   399.99
Asus - 11.6&quot; Touch-Screen Laptop - Intel Celeron - 4GB Memory - 500GB Hard Drive - Black   279.99
...

scala> val pageUrls:Array[String] = lExtractor.getPagination(doc)
...

scala> for(u <- pageUrls){println(u)}
/site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000&gf=y&cp=2
/site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000&gf=y&cp=3
/site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000&gf=y&cp=4
/site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000&gf=y&cp=5
...

As you can see we’ve created a valid instance of a LaptopExtractor trait and returning a list of POJO’s from the BestBuy web page using JavaScript. This approach also builds upon a robust library – Jsoup. We also used Scala’s REPL to rapidly prototype our extraction. To sum it up:

//platform.twitter.com/widgets.js

Further Readings

Advertisement

4 thoughts on “Web Scraping with JDK 8 ScriptEngine (Nashorn) and Scala

  1. Please correct line 2 of your trait. It should be
    def extractLaptops
    Thanks for sharing, this post has been extremely useful to me.

    Like

Leave a Reply to wickund Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s