Reliably Store Postfix Logs in S3 with Apache Flume and rsyslog

Flume is a distributed system to aggregate log files into the Hadoop Distributed File System (HDFS). It has a simple design of Events, Sources, Sinks, and Channels which can be connected into a complex multi-hop architecture.

While Flume is designed to be resilient “with tunable reliability mechanisms for fail-over and recovery” in this blog post we’ll also look at the reliable forwarding of rsyslog, which we are going to use to store postfix logs in Amazon S3.

Continue reading “Reliably Store Postfix Logs in S3 with Apache Flume and rsyslog” →

Using the Cassandra Bulk Loader with Hadoop BulkOutputFormat

For the purpose of bulk-loading external data into a cluster Cassandra 0.8 introduced the sstableloader. The sstableloader streams a set of sstables into a live cluster without simply copying every table to every node, but only transfer the relevant part of the data to each node by conforming to the replication strategy of the cluster.

Using this tool with Hadoop makes a lot of sense since this should put less strain on the cluster, than having multiple Mappers or Reducers communicate with the cluster. Although the P2P concept of Cassandra should assist in such an approach, simple tests have shown that the throughput could be increased by 20-25% using the BulkOutputFormat, which makes use of the sstableloader, in Hadoop [Improved Hadoop output in Cassandra 1.1].

This article describes the use of the Cassandra BulkOutputFormat with Hadoop and provides the source code of a simple example. The sample application implements the word count example – what else – with Cassandra 1.1.6 and Hadoop 0.20.205 .

Continue reading “Using the Cassandra Bulk Loader with Hadoop BulkOutputFormat” →

Hadoop: Counting Words

As you may know, Hadoop is a distributed System for counting words. Of course it is not, but the “Word Count” program is a widely accepted example of MapReduce. To be true it is so widely applied, that many people feel that the “Word Count” example is overused. Than again it is a straightforward example of how MapReduce works. In this post I give some other examples of counting words. One of the example is implemented with Hadoop Streaming API and Node.js.

Bash

find shakespeare/ -type f -exec cat {} ; | tr -c '^a-zA-Z' 'n' | tr 'A-Z' 'a-z' 
| sort | uniq -c | sort -rn

Node.js

#!/usr/bin/env node
process.stdin.resume();
process.stdin.setEncoding('utf8');

process.stdin.on('data', function (chunk) {
  word_pattern = /[^a-zA-Z]/g;
  var tempArray = chunk.toLowerCase().replace(word_pattern, 'n').split('n');
  for (var i = 0; i < tempArray.length; i++) {
    if(tempArray[i] != '')
      process.stdout.write(tempArray[i] +"t" + 1 + "n");
  }
});
process.stdin.on('end', function () {});

#!/usr/bin/env node
process.stdin.resume();
process.stdin.setEncoding('utf8');

var words = {};
process.stdin.on('data', function (chunk) {
  var tempArray = chunk.split('n');
  for(var i = 0; i < tempArray.length-1; i++){
    var w = tempArray[i].split('t');
    if(w.length > 0 ){
      if(typeof(words[w[0]]) !== 'undefined' && words[w[0]] !== null){
        words[w[0]] += parseInt(w[1]);
      }else{ words[w[0]] = parseInt(w[1]);}
    }
  }
});

process.stdin.on('end', function () {
  for (var key in words) {
    if (words.hasOwnProperty(key)) {
      process.stdout.write(key + "t" + words[key] +"n");
    }
  }
});

Sample execution:

% find shakespeare/ -type f -exec cat {} ; | mapper.js | sort | reducer.js

Don’t forget to sort and shuffle, which is the phase of Hadoop before the reducer starts (| sort | ).

Node.js + Hadoop Streaming

% hadoop jar /usr/local/lib/hadoop/hadoop-0.20.2/contrib/streaming/ha*-streaming.jar 
-file mapper.js -file reducer.js 
-mapper mapper.js -reducer reducer.js 
-input shakespeare -output count_js