As you may know, Hadoop is a distributed System for counting words. Of course it is not, but the “Word Count” program is a widely accepted example of MapReduce. To be true it is so widely applied, that many people feel that the “Word Count” example is overused. Than again it is a straightforward example of how MapReduce works. In this post I give some other examples of counting words. One of the example is implemented with Hadoop Streaming API and Node.js.
- Bash
find shakespeare/ -type f -exec cat {} ; | tr -c '^a-zA-Z' 'n' | tr 'A-Z' 'a-z'
| sort | uniq -c | sort -rn
- Node.js
#!/usr/bin/env node
process.stdin.resume();
process.stdin.setEncoding('utf8');
process.stdin.on('data', function (chunk) {
word_pattern = /[^a-zA-Z]/g;
var tempArray = chunk.toLowerCase().replace(word_pattern, 'n').split('n');
for (var i = 0; i < tempArray.length; i++) {
if(tempArray[i] != '')
process.stdout.write(tempArray[i] +"t" + 1 + "n");
}
});
process.stdin.on('end', function () {});
#!/usr/bin/env node
process.stdin.resume();
process.stdin.setEncoding('utf8');
var words = {};
process.stdin.on('data', function (chunk) {
var tempArray = chunk.split('n');
for(var i = 0; i < tempArray.length-1; i++){
var w = tempArray[i].split('t');
if(w.length > 0 ){
if(typeof(words[w[0]]) !== 'undefined' && words[w[0]] !== null){
words[w[0]] += parseInt(w[1]);
}else{ words[w[0]] = parseInt(w[1]);}
}
}
});
process.stdin.on('end', function () {
for (var key in words) {
if (words.hasOwnProperty(key)) {
process.stdout.write(key + "t" + words[key] +"n");
}
}
});
Sample execution:
% find shakespeare/ -type f -exec cat {} ; | mapper.js | sort | reducer.js
Don’t forget to sort and shuffle, which is the phase of Hadoop before the reducer starts (| sort | ).
- Node.js + Hadoop Streaming
% hadoop jar /usr/local/lib/hadoop/hadoop-0.20.2/contrib/streaming/ha*-streaming.jar
-file mapper.js -file reducer.js
-mapper mapper.js -reducer reducer.js
-input shakespeare -output count_js