Python Virtualenv with Hadoop Streaming

If you are using Python with Hadoop Streaming a lot then you might know about the trouble of keeping all nodes up to date with required packages. A nice way to work around this is to use Virtualenv for each streaming project. Besides the hurdle of keeping all nodes in sync with the necessary libraries another advantage of using Virtualenv is the possibility to try different versions and setups within the same project seamlessly.

In this example we are going to create a Python job that counts the n-grams of hotel names in relation to the country the hotel is located in. Besides the use of a Virtualenv where we install NLTK, we are going to strive the use of Avro as an input for a Python streaming job, as well as secondary sorting with the use of KeyFieldBasedPartitioner and KeyFieldBasedComparator . Continue reading “Python Virtualenv with Hadoop Streaming” →

Closures with JavaScript and Python

Closures are functions or references to functions that hold within their scope non-local variables. This variables endure beyond their existence outside of these functions scope. These variables are therefor enclosed within the lexical-scope of that functions.
This is particular useful for JavaScript where with every function call (even if the same is called recursively) a new execution context is created, and an automatic garbage collection throws out all contexts with no reference. For a detailed explanation Jim Ley’s description of closures in JavaScript has proven itself as a great resource. Continue reading “Closures with JavaScript and Python” →