Web scraping is the process of extracting entities from web pages. This entities can either be news articles, blog posts, products, or any other information displayed on the web. Web pages consist of HTML (a XML like structure) and therefor present information in a structured form, in a DOM (Document Object Model), which can be extracted. The structure is usually not very strict or complaint to other web pages and object to change as the web page evolves. Using scripts over compiled programming languages has therefor a lot of advantages. On the other hand compiled languages have often a speed advantage as well as a broad foundation of fundamental libraries. The DOM reference implementation of the W3C is written in Java for example.
Java’s approach of opening up the JVM (Java Virtual Machine) for other (dynamically typed) languages is a great way to combine the benefits of both, a compiled and scripted language. With JDK 8 compiling dynamic languages to the JVM has become simpler with potentially improved implementations of compilers and runtime systems through the invokedynamic instruction. I found that this presentation currently explains best what invokedynamic is and how it works in general.
The language of the web is without doubt JavaScript. Fortunately the JDK 8 comes with a JavaScript implementation, which is called Nashorn and makes use of JSR 292 and invokedynamic. If you download the jdk you can simply get started with the jjs (JavaJavaScript) bin which is a REPL for Nashorn. To learn more about Nashorn you can read here, here, and here. In this post I would like to demonstrate how to use it for web scraping.
For this demonstration we’ll extract products from the BestBuy web page. The heavy lifting will actually be performed by Jsoup a Java HTML Parser. Jsoup has no third party dependencies and we’ll use it to fetch and parse the HTML. Since Nashorn is part of the JDK and Jsoup has not third party dependencies we can keep our whole approach very light weighted.
Nashorn and JavaJavaScript
To extract laptops from BestBuy we will use a JavaScript extractor script. Later the model class and the interface for the extractor will be defined in Scala and Java. By doing this we can perfectly describe and test how a extractor should work and what it’s supposed to return while still using a dynamic language for extraction.
To get started with Scala, Jsoup and Nashorn we can use Scala’s REPL. We just include Jsoup into the class path and write a first JavaScript to get our self acquainted to the way it all works.
%> scala -classpath jsoup-1.7.3.jar scala> import javax.script.{Invocable, ScriptEngine, ScriptEngineManager} import javax.script.{Invocable, ScriptEngine, ScriptEngineManager} scala> val manager: ScriptEngineManager = new ScriptEngineManager manager: javax.script.ScriptEngineManager = javax.script.ScriptEngineManager@1936f0f5 scala> val engine: ScriptEngine = manager.getEngineByName("nashorn") engine: javax.script.ScriptEngine = jdk.nashorn.api.scripting.NashornScriptEngine@4ec4f3a0 scala> engine.eval("function hello(){print('Hello');} hello();") Hello res0: Object = null
Here we simply launch the scala interpreter and write our first JavaScript. As mentioned earlier we are going to use Jsoup to fetch and parse the HTML from BestBuy‘s web page. This will give us an HTML Document object which we can use to extract elements by using CSS’s selectors for example. The Document object can be injected into our JavaScript and used by converting our engine into a javax.script.Invocable:
scala> import org.jsoup.nodes.Document import org.jsoup.nodes.Document scala> import org.jsoup.Jsoup import org.jsoup.Jsoup scala> val doc: Document = Jsoup.connect("http://www.bestbuy.com/site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000").get doc: org.jsoup.nodes.Document = <!DOCTYPE html> .... scala> engine.eval("function extractTitle(doc){ print(doc.select('head title').first().html()); }") res1: Object = function extractTitle(doc){ print(doc.select('head title').first().html()); } scala> val in: Invocable = engine.asInstanceOf[Invocable]; in: javax.script.Invocable = jdk.nashorn.api.scripting.NashornScriptEngine@3ad83a66 scala> in.invokeFunction("extractTitle", doc); PC Laptops - Best Buy res3: Object = null
We now have created our first extracting script to be used with Jsoup and Scala. This was a very simple and rapid approach with little dependencies needed. To make this all a little bit more robust an useful we are going to introduce a interface for the extractor and a model class to make the extraction more accountable and well defined.
Currently a function like the one above could return anything, a plain JavaScript object, a string, etc. Developers could structure the script in almost anyway possible. Having multiple web pages and multiple developers this can become a very tedious thing to maintain and test.
For this purpose we will create an extractor interface describing the methods we would expect from a script – written in any possible language, like Python, Ruby, and Scala (with 2.11 Scala itself is JSR-223 complient) – and use POJO to specify what an extractor should return. Evaluating prior to execution gives us even more accountability besides writing proper tests.
We are not going to over-complicate things and just define two methods. One method to return pagination URLs and on method to return a list of laptops. The extractor interface could look like this:
trait LaptopExtractor { def extractLaptops(doc: Document): Array[Laptop] // list of laptops def getPagination(doc: Document): Array[String] // list of URLs to fetch next }
A possible script we could use is the following written in JavaScript:
function extractLaptops(doc){ var LaptopClass = Java.type("bestbuy.Laptop"); var laptops = new Array(); doc.select('div#listView > div.hproduct').forEach(function(l){ var laptop = new LaptopClass(); laptop.name = l.select('h3[itemprop=name] > a').first().html(); laptop.price = parseFloat(l.select('span[itemprop=price]').first().html().replace("$","")); laptops.push(laptop); }); return laptops; } function getPagination(doc){ var urls = new Array(); doc.select('ul.pagination > li > a').forEach(function(l){ urls.push(l.attr('href')); }); return urls; }
What is still missing and we are going to define next is our laptop POJO. From the example script above you can already see that this POJO also is going to very simple. Let’s have a look:
package bestbuy; public class Laptop { public String name; public Double price; public Laptop(){} public Laptop(String name, Double price){this.name = name; this.price=price;} }
Putting it All Together
Let’s now put this all together and use javax.script.Invocable to create from our extractor script a perfectly defined extractor object we can use to scrap laptops from BestBuy. We will use Scala’s REPL once again:
%> scala -classpath jsoup-1.7.3.jar:bestbuy.jar scala> import bestbuy.Laptop import bestbuy.Laptop scala> import org.jsoup.Jsoup import org.jsoup.Jsoup scala> import org.jsoup.nodes.Document import org.jsoup.nodes.Document scala> import javax.script.{Invocable, ScriptEngine, ScriptEngineManager} import javax.script.{Invocable, ScriptEngine, ScriptEngineManager} scala> val manager: ScriptEngineManager = new ScriptEngineManager manager: javax.script.ScriptEngineManager = javax.script.ScriptEngineManager@725bef66 scala> val engine: ScriptEngine = manager.getEngineByName("nashorn") engine: javax.script.ScriptEngine = jdk.nashorn.api.scripting.NashornScriptEngine@31368b99 scala> val script: String = scala.io.Source.fromFile(new java.io.File("~/extractLaptop.js")).mkString script: String = "function extractLaptops(doc){... scala> engine.eval(script) res4: Object = function getPagination(doc){... scala> val in: Invocable = engine.asInstanceOf[Invocable] in: javax.script.Invocable = jdk.nashorn.api.scripting.NashornScriptEngine@31368b99 scala> trait LaptopExtractor { | def extractLatops(doc: Document): Array[Laptop] // list of laptops | def getPagination(doc: Document): Array[String] // list of URLs to fetch next | } defined trait LaptopExtractor scala> val lExtractor: LaptopExtractor = in.getInterface(classOf[LaptopExtractor]) lExtractor: LaptopExtractor = LaptopExtractor$$NashornJavaAdapter@5e76a2bb scala> val doc: Document = Jsoup.connect("http://www.bestbuy.com/site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000").get doc: org.jsoup.nodes.Document = <!DOCTYPE html> ... scala> val laptops:Array[Laptop] = lExtractor.extractLaptops(doc) ... scala> for(lap <- laptops){ println(lap.name+"t"+ lap.price)} HP - TouchSmart 15.6" Touch-Screen Laptop - 4GB Memory - 750GB Hard Drive - Sparkling Black 379.99 Toshiba - Satellite 15.6" Laptop - Intel Celeron - 4GB Memory - 500GB Hard Drive - Satin Black 249.99 HP - Pavilion 17.3" Laptop - 4GB Memory - 750GB Hard Drive - Anodized Silver 399.99 Asus - 11.6" Touch-Screen Laptop - Intel Celeron - 4GB Memory - 500GB Hard Drive - Black 279.99 ... scala> val pageUrls:Array[String] = lExtractor.getPagination(doc) ... scala> for(u <- pageUrls){println(u)} /site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000&gf=y&cp=2 /site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000&gf=y&cp=3 /site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000&gf=y&cp=4 /site/laptop-computers/pc-laptops/pcmcat247400050000.c?id=pcmcat247400050000&gf=y&cp=5 ...
As you can see we’ve created a valid instance of a LaptopExtractor trait and returning a list of POJO’s from the BestBuy web page using JavaScript. This approach also builds upon a robust library – Jsoup. We also used Scala’s REPL to rapidly prototype our extraction. To sum it up:
“Hey there’s static type inference in your dynamic language compiler” or what I’ve been working on for past 4 months: http://t.co/spLMKWbaIn
— Attila Szegedi (@asz) 13. Mai 2014
//platform.twitter.com/widgets.js
Web Scraping with JDK 8 ScriptEngine (Nashorn) and Scala http://t.co/tl5m6qBno3
LikeLike
RT @jonbros: Web Scraping with JDK 8 ScriptEngine (Nashorn) and Scala http://t.co/tl5m6qBno3
LikeLike
Please correct line 2 of your trait. It should be
def extractLaptops
Thanks for sharing, this post has been extremely useful to me.
LikeLike
Thanks for your feedback! Much appreciated!
LikeLike