My 1st ZooKeeper Recipe: Distributed Queue with Locking

ZooKeeper is in use at many different distributed systems and could probably also be of great help to your application, when it comes to distributed synchronization, and maintaining configuration. It promises to be simple, expressive, highly available, and highly reliable. To get an better understanding of what Zookeeper can do for you it’s a good idea to gain a solid understand of it’s implementation details and the consistency guarantees it makes. And it’s sure worth looking at the ZooKeeper Recipes and Solutions page.

As I just recently had the opportunity to take a closer look at ZooKeeper for the first time, this article does little more than provide the ‘Notes of a Beginner’. You’ll find here a brief overview of the consistency guarantees ZooKeeper makes, and a sample application. I decided to implement a distributed queue with locking, which I found as an example on the Solutions and Recipes page.
Imagine a web crawler who stores it’s links in a queue, so that other instances can grab link by link from the queue and crawl the web pages. In this scenario we wouldn’t want multiple instances crawling the same link, therefor we lock the link prior to performing the fetch. We don’t just remove it from the queue, because the fetch could fail. On the other hand if the instance dies unexpectedly during the fetch the lock needs to be released, so another instance can pick from there. All this can easily be done with ZooKeeper. Continue reading “My 1st ZooKeeper Recipe: Distributed Queue with Locking” →

Using the Cassandra Bulk Loader with Hadoop BulkOutputFormat

For the purpose of bulk-loading external data into a cluster Cassandra 0.8 introduced the sstableloader. The sstableloader streams a set of sstables into a live cluster without simply copying every table to every node, but only transfer the relevant part of the data to each node by conforming to the replication strategy of the cluster.

Using this tool with Hadoop makes a lot of sense since this should put less strain on the cluster, than having multiple Mappers or Reducers communicate with the cluster. Although the P2P concept of Cassandra should assist in such an approach, simple tests have shown that the throughput could be increased by 20-25% using the BulkOutputFormat, which makes use of the sstableloader, in Hadoop [Improved Hadoop output in Cassandra 1.1].

This article describes the use of the Cassandra BulkOutputFormat with Hadoop and provides the source code of a simple example. The sample application implements the word count example – what else – with Cassandra 1.1.6 and Hadoop 0.20.205 .

Continue reading “Using the Cassandra Bulk Loader with Hadoop BulkOutputFormat” →

Hadoop: Counting Words

As you may know, Hadoop is a distributed System for counting words. Of course it is not, but the “Word Count” program is a widely accepted example of MapReduce. To be true it is so widely applied, that many people feel that the “Word Count” example is overused. Than again it is a straightforward example of how MapReduce works. In this post I give some other examples of counting words. One of the example is implemented with Hadoop Streaming API and Node.js.

Bash

find shakespeare/ -type f -exec cat {} ; | tr -c '^a-zA-Z' 'n' | tr 'A-Z' 'a-z' 
| sort | uniq -c | sort -rn

Node.js

#!/usr/bin/env node
process.stdin.resume();
process.stdin.setEncoding('utf8');

process.stdin.on('data', function (chunk) {
  word_pattern = /[^a-zA-Z]/g;
  var tempArray = chunk.toLowerCase().replace(word_pattern, 'n').split('n');
  for (var i = 0; i < tempArray.length; i++) {
    if(tempArray[i] != '')
      process.stdout.write(tempArray[i] +"t" + 1 + "n");
  }
});
process.stdin.on('end', function () {});

#!/usr/bin/env node
process.stdin.resume();
process.stdin.setEncoding('utf8');

var words = {};
process.stdin.on('data', function (chunk) {
  var tempArray = chunk.split('n');
  for(var i = 0; i < tempArray.length-1; i++){
    var w = tempArray[i].split('t');
    if(w.length > 0 ){
      if(typeof(words[w[0]]) !== 'undefined' && words[w[0]] !== null){
        words[w[0]] += parseInt(w[1]);
      }else{ words[w[0]] = parseInt(w[1]);}
    }
  }
});

process.stdin.on('end', function () {
  for (var key in words) {
    if (words.hasOwnProperty(key)) {
      process.stdout.write(key + "t" + words[key] +"n");
    }
  }
});

Sample execution:

% find shakespeare/ -type f -exec cat {} ; | mapper.js | sort | reducer.js

Don’t forget to sort and shuffle, which is the phase of Hadoop before the reducer starts (| sort | ).

Node.js + Hadoop Streaming

% hadoop jar /usr/local/lib/hadoop/hadoop-0.20.2/contrib/streaming/ha*-streaming.jar 
-file mapper.js -file reducer.js 
-mapper mapper.js -reducer reducer.js 
-input shakespeare -output count_js

Using Self-Signed Certificates with Java and Maven

JAVA applications using JSSE (Java Secure Socket Extension) can’t connect to servers with self-signed or untrusted certificates by default. Maven for example is not able to download required dependencies from a nexus server, if that uses a self-signed certificate or the certificate authority is not recognized. If you try to connect to a server of that kind a security ValidatorException will be thrown:

Error transferring file: sun.security.validator.ValidatorException:
PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException:
unable to find valid certification path to requested target

Continue reading →