Hadoop: Counting Words

As you may know, Hadoop is a distributed System for counting words. Of course it is not, but the “Word Count” program is a widely accepted example of MapReduce. To be true it is so widely applied, that many people feel that the “Word Count” example is overused. Than again it is a straightforward example of how MapReduce works. In this post I give some other examples of counting words. One of the example is implemented with Hadoop Streaming API and Node.js.

  1. Bash
    find shakespeare/ -type f -exec cat {} ; | tr -c '^a-zA-Z' 'n' | tr 'A-Z' 'a-z' 
    | sort | uniq -c | sort -rn
  2. Node.js
    #!/usr/bin/env node
    process.stdin.resume();
    process.stdin.setEncoding('utf8');
    
    process.stdin.on('data', function (chunk) {
      word_pattern = /[^a-zA-Z]/g;
      var tempArray = chunk.toLowerCase().replace(word_pattern, 'n').split('n');
      for (var i = 0; i < tempArray.length; i++) {
        if(tempArray[i] != '')
          process.stdout.write(tempArray[i] +"t" + 1 + "n");
      }
    });
    process.stdin.on('end', function () {});
    #!/usr/bin/env node
    process.stdin.resume();
    process.stdin.setEncoding('utf8');
    
    var words = {};
    process.stdin.on('data', function (chunk) {
      var tempArray = chunk.split('n');
      for(var i = 0; i < tempArray.length-1; i++){
        var w = tempArray[i].split('t');
        if(w.length > 0 ){
          if(typeof(words[w[0]]) !== 'undefined' && words[w[0]] !== null){
            words[w[0]] += parseInt(w[1]);
          }else{ words[w[0]] = parseInt(w[1]);}
        }
      }
    });
    
    process.stdin.on('end', function () {
      for (var key in words) {
        if (words.hasOwnProperty(key)) {
          process.stdout.write(key + "t" + words[key] +"n");
        }
      }
    });

    Sample execution:

    % find shakespeare/ -type f -exec cat {} ; | mapper.js | sort | reducer.js

    Don’t forget to sort and shuffle, which is the phase of Hadoop before the reducer starts (| sort | ).

  3. Node.js + Hadoop Streaming
    % hadoop jar /usr/local/lib/hadoop/hadoop-0.20.2/contrib/streaming/ha*-streaming.jar 
    -file mapper.js -file reducer.js 
    -mapper mapper.js -reducer reducer.js 
    -input shakespeare -output count_js
Advertisement