Web Scraping with JDK 8 ScriptEngine (Nashorn) and Scala

Web scraping is the process of extracting entities from web pages. This entities can either be news articles, blog posts, products, or any other information displayed on the web. Web pages consist of HTML (a XML like structure) and therefor present information in a structured form, in a DOM (Document Object Model), which can be extracted. The structure is usually not very strict or complaint to other web pages and object to change as the web page evolves. Using scripts over compiled programming languages has therefor a lot of advantages. On the other hand compiled languages have often a speed advantage as well as a broad foundation of fundamental libraries. The DOM reference implementation of the W3C is written in Java for example.

Java’s approach of opening up the JVM (Java Virtual Machine) for other (dynamically typed) languages is a great way to combine the benefits of both, a compiled and scripted language. With JDK 8 compiling dynamic languages to the JVM has become simpler with potentially improved implementations of compilers and runtime systems through the invokedynamic instruction. I found that this presentation currently explains best what invokedynamic is and how it works in general.

The language of the web is without doubt JavaScript. Fortunately the JDK 8 comes with a JavaScript implementation, which is called Nashorn and makes use of JSR 292 and invokedynamic. If you download the jdk you can simply get started with the jjs (JavaJavaScript) bin which is a REPL for Nashorn. To learn more about Nashorn you can read here, here, and here. In this post I would like to demonstrate how to use it for web scraping.

Continue reading “Web Scraping with JDK 8 ScriptEngine (Nashorn) and Scala” →

Introduction to Node.js at the GfK Nurago and SirValUse Academy 2013

[wp_pdfjs id=1058 ]

Overview

This is an Introduction to Node.js given at the GfK Nurago and SirValUse Academy 2013 in Hamburg. In this talk I try to reason about the use of server-side JavaScript and try to workout what makes Node.js so special – The Event Loop.
At the end of the talk I give some examples of which I think are particular good use cases for Node.js. I also make some notes about running Node in a productive environment.

For notes Continue reading “Introduction to Node.js at the GfK Nurago and SirValUse Academy 2013” →

Closures with JavaScript and Python

Closures are functions or references to functions that hold within their scope non-local variables. This variables endure beyond their existence outside of these functions scope. These variables are therefor enclosed within the lexical-scope of that functions.
This is particular useful for JavaScript where with every function call (even if the same is called recursively) a new execution context is created, and an automatic garbage collection throws out all contexts with no reference. For a detailed explanation Jim Ley’s description of closures in JavaScript has proven itself as a great resource. Continue reading “Closures with JavaScript and Python” →