As you may know, Hadoop is a distributed System for counting words. Of course it is not, but the “Word Count” program is a widely accepted example of MapReduce. To be true it is so widely applied, that many people feel that the “Word Count” example is overused. Than again it is a straightforward example of how MapReduce works. In this post I give some other examples of counting words. One of the example is implemented with Hadoop Streaming API and Node.js.
- Bash
find shakespeare/ -type f -exec cat {} ; | tr -c '^a-zA-Z' 'n' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn
- Node.js
#!/usr/bin/env node process.stdin.resume(); process.stdin.setEncoding('utf8'); process.stdin.on('data', function (chunk) { word_pattern = /[^a-zA-Z]/g; var tempArray = chunk.toLowerCase().replace(word_pattern, 'n').split('n'); for (var i = 0; i < tempArray.length; i++) { if(tempArray[i] != '') process.stdout.write(tempArray[i] +"t" + 1 + "n"); } }); process.stdin.on('end', function () {});
#!/usr/bin/env node process.stdin.resume(); process.stdin.setEncoding('utf8'); var words = {}; process.stdin.on('data', function (chunk) { var tempArray = chunk.split('n'); for(var i = 0; i < tempArray.length-1; i++){ var w = tempArray[i].split('t'); if(w.length > 0 ){ if(typeof(words[w[0]]) !== 'undefined' && words[w[0]] !== null){ words[w[0]] += parseInt(w[1]); }else{ words[w[0]] = parseInt(w[1]);} } } }); process.stdin.on('end', function () { for (var key in words) { if (words.hasOwnProperty(key)) { process.stdout.write(key + "t" + words[key] +"n"); } } });
Sample execution:
% find shakespeare/ -type f -exec cat {} ; | mapper.js | sort | reducer.js
Don’t forget to sort and shuffle, which is the phase of Hadoop before the reducer starts (| sort | ).
- Node.js + Hadoop Streaming
% hadoop jar /usr/local/lib/hadoop/hadoop-0.20.2/contrib/streaming/ha*-streaming.jar -file mapper.js -file reducer.js -mapper mapper.js -reducer reducer.js -input shakespeare -output count_js