For the purpose of bulk-loading external data into a cluster Cassandra 0.8 introduced the
sstableloader streams a set of
sstables into a live cluster without simply copying every table to every node, but only transfer the relevant part of the data to each node by conforming to the replication strategy of the cluster.
Using this tool with Hadoop makes a lot of sense since this should put less strain on the cluster, than having multiple Mappers or Reducers communicate with the cluster. Although the P2P concept of Cassandra should assist in such an approach, simple tests have shown that the throughput could be increased by 20-25% using the
BulkOutputFormat, which makes use of the
sstableloader, in Hadoop [Improved Hadoop output in Cassandra 1.1].
This article describes the use of the Cassandra
BulkOutputFormat with Hadoop and provides the source code of a simple example. The sample application implements the word count example – what else – with Cassandra 1.1.6 and Hadoop 0.20.205 .