June 2012 – henning.kropponline.de

As you may know, Hadoop is a distributed System for counting words. Of course it is not, but the “Word Count” program is a widely accepted example of MapReduce. To be true it is so widely applied, that many people feel that the “Word Count” example is overused. Than again it is a straightforward example of how MapReduce works. In this post I give some other examples of counting words. One of the example is implemented with Hadoop Streaming API and Node.js.

Bash

find shakespeare/ -type f -exec cat {} ; | tr -c '^a-zA-Z' 'n' | tr 'A-Z' 'a-z' 
| sort | uniq -c | sort -rn

Node.js

#!/usr/bin/env node
process.stdin.resume();
process.stdin.setEncoding('utf8');

process.stdin.on('data', function (chunk) {
  word_pattern = /[^a-zA-Z]/g;
  var tempArray = chunk.toLowerCase().replace(word_pattern, 'n').split('n');
  for (var i = 0; i < tempArray.length; i++) {
    if(tempArray[i] != '')
      process.stdout.write(tempArray[i] +"t" + 1 + "n");
  }
});
process.stdin.on('end', function () {});

#!/usr/bin/env node
process.stdin.resume();
process.stdin.setEncoding('utf8');

var words = {};
process.stdin.on('data', function (chunk) {
  var tempArray = chunk.split('n');
  for(var i = 0; i < tempArray.length-1; i++){
    var w = tempArray[i].split('t');
    if(w.length > 0 ){
      if(typeof(words[w[0]]) !== 'undefined' && words[w[0]] !== null){
        words[w[0]] += parseInt(w[1]);
      }else{ words[w[0]] = parseInt(w[1]);}
    }
  }
});

process.stdin.on('end', function () {
  for (var key in words) {
    if (words.hasOwnProperty(key)) {
      process.stdout.write(key + "t" + words[key] +"n");
    }
  }
});

Sample execution:

% find shakespeare/ -type f -exec cat {} ; | mapper.js | sort | reducer.js

Don’t forget to sort and shuffle, which is the phase of Hadoop before the reducer starts (| sort | ).

Node.js + Hadoop Streaming

% hadoop jar /usr/local/lib/hadoop/hadoop-0.20.2/contrib/streaming/ha*-streaming.jar 
-file mapper.js -file reducer.js 
-mapper mapper.js -reducer reducer.js 
-input shakespeare -output count_js

Month: June 2012

Hadoop: Counting Words