Hive Streaming with Storm

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released as a Technical Preview to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally has been a batch oriented storage.

Based on the newly given ACID insert/update capabilities of Hive, the Streaming API is breaking down a stream of data into smaller batches which get committed in a transaction to the underlying storage. Once committed the data becomes immediately available for other queries.

Broadly speaking the API consists of two parts. One part is handling the transaction while the other is dealing with the underlying storage (HDFS). Transactions in Hive are handled by the the Metastore. Kerberos is supported from the beginning!

Some of the current limitations are:

  • Only delimited input data and JSON (strict syntax) are supported
  • Only ORC support
  • Hive table must be bucketed (unpartitioned tables are supported)

In this post I would like to demonstrate the use of a newly created Storm HiveBolt that makes use of the streaming API and is quite straightforward to use. The source of the here described example is provided at GitHub. To run this demo you would need a HDP 2.2 Sandbox, which can be downloaded for various virtualization environments here. Continue reading “Hive Streaming with Storm”

HDP Ansible Playbook Example

In my existing collection of automated install scripts for HDP I always try to extend it with further examples of different provisioners, providers, and settings. Recently I added with hdp22-n1-centos6-ansible an example Ansible environment for preparing a HDP 2.2 installation on one node.

Ansible differs from other provisioners like Puppet or Chef by a simplified approach in dependence on SSH. It behaves almost as a distributed shell putting little dependencies on existing hosts. Where for example Puppet makes strong assumptions about the current state of a system with one or multiple nodes, does Ansible more or less reflect a collection of tasks a system gone through to reach it’s current state. While some celebrate Ansible for it’s simplicity do others abandon it for it’s lack of strong integrity.

In this post I would like to share a sample Ansible Playbook to prepare a HDP 2.2 Amabri installation using Vagrant with Virtualbox. You can download and view the in this post discussed example here. Continue reading “HDP Ansible Playbook Example”