Using the Cassandra Bulk Loader with Hadoop BulkOutputFormat

For the purpose of bulk-loading external data into a cluster Cassandra 0.8 introduced the sstableloader. The sstableloader streams a set of sstables into a live cluster without simply copying every table to every node, but only transfer the relevant part of the data to each node by conforming to the replication strategy of the cluster.

Using this tool with Hadoop makes a lot of sense since this should put less strain on the cluster, than having multiple Mappers or Reducers communicate with the cluster. Although the P2P concept of Cassandra should assist in such an approach, simple tests have shown that the throughput could be increased by 20-25% using the BulkOutputFormat, which makes use of the sstableloader, in Hadoop [Improved Hadoop output in Cassandra 1.1].

This article describes the use of the Cassandra BulkOutputFormat with Hadoop and provides the source code of a simple example. The sample application implements the word count example – what else – with Cassandra 1.1.6 and Hadoop 0.20.205 .

Continue reading “Using the Cassandra Bulk Loader with Hadoop BulkOutputFormat”

Hadoop: Counting Words

As you may know, Hadoop is a distributed System for counting words. Of course it is not, but the “Word Count” program is a widely accepted example of MapReduce. To be true it is so widely applied, that many people feel that the “Word Count” example is overused. Than again it is a straightforward example of how MapReduce works. In this post I give some other examples of counting words. One of the example is implemented with Hadoop Streaming API and Node.js.

  1. Bash
  2. Node.js


    Sample execution:

    Don’t forget to sort and shuffle, which is the phase of Hadoop before the reducer starts ( | sort | ).
  3. Node.js + Hadoop Streaming

Using Self-Signed Certificates with Java and Maven

JAVA applications using JSSE (Java Secure Socket Extension) can’t connect to servers with self-signed or untrusted certificates by default. Maven for example is not able to download required dependencies from a nexus server, if that uses a self-signed certificate or the certificate authority is not recognized. If you try to connect to a server of that kind a security ValidatorException will be thrown: