Hong Kong

10 Resources About Storm

The event processing framework Apache Storm is the preferred approach for real-time Big Data. In use at large companies around the world it proofs it’s maturity every day at scale. This post collects some of the resources helpful in understanding what Storm is and contains also some sources that highlight the special relationship Kafka enjoys.

Continue reading “10 Resources About Storm” →

Azure VNet for Your HDP Cluster

In a series of blog posts I demonstrated how to create a custom OS Image for automatic provisioning of HDP with Vagrant on the Azure Cloud. On GitHub I share the result of a first layout among other provisioning scripts. Until now the setup was done without the proper network configuration required for the communication of the individual components of the cluster.

With the new release of the vagrant-azure plugin it will be possible to setup the cluster in a dedicated VNet. This is the last missing peace in the series of work I published to allow the automated provisioning of HDP in Azure. Unfortunately this is not quite true, as the current Ruby SDK of Azure does not allow the passing of IP addresses to the machines. We therefor have to create host entries currently by hand. It could be possible to setup a DNS or use Puppet to conduct the host mapping in a automated fashion, but I at least was not able to do so as part of this work here. Continue reading “Azure VNet for Your HDP Cluster” →

Ambari Augeas Module and Configuration

Operating large to small size clusters needs automation; as well does installation. Going repeatedly through installation instructions, typing the same commands over and over again is really no fun. Worse it’s a waste of resources. The same holds true for keeping the configuration of 10 or 100 nodes in sync. A first step is to use a distributed shell and the next to turn to tools like Puppet, Cobblerd, Ansible, or Chef, which promise full automation of cluster operations.

As with most automation processes this tools tend to make the common parts easy and the not so common steps hard. The fact that every setup is different requires concepts of adaptability that are easy to use. One of this tools for adapting configuration files could be seen in Augeas. Although hard to understand at time Augeas is easy to use, as it turns arbitrary configuration files into trees on which typical tree operations can be performed.

This post highlights the possibilities to us Augeas in a Puppet setup to install and configure a HDP Ambari cluster. Ambari agents are part of every node in a cluster, so when adding new nodes to a cluster in an automated fashion you would want to be prepared. Continue reading “Ambari Augeas Module and Configuration” →

Automated Ambari Install with Existing Database

If not specified differently Amabari server will be install together with a Postgres database for it’s meta-data among a MySQL database for the Hive Metastore. A Derby database will also be installed for storing scheduling information of Oozie workflows. This post covers the installation of Ambari server with an existing database. In support of an automated install with Ambari Blueprints configuring the Hive Metastore and Oozie will also be provided. Continue reading “Automated Ambari Install with Existing Database” →

Hive Streaming with Storm

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released as a Technical Preview to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally has been a batch oriented storage.

Based on the newly given ACID insert/update capabilities of Hive, the Streaming API is breaking down a stream of data into smaller batches which get committed in a transaction to the underlying storage. Once committed the data becomes immediately available for other queries.

Broadly speaking the API consists of two parts. One part is handling the transaction while the other is dealing with the underlying storage (HDFS). Transactions in Hive are handled by the the Metastore. Kerberos is supported from the beginning!

Some of the current limitations are:

Only delimited input data and JSON (strict syntax) are supported
Only ORC support
Hive table must be bucketed (unpartitioned tables are supported)

In this post I would like to demonstrate the use of a newly created Storm HiveBolt that makes use of the streaming API and is quite straightforward to use. The source of the here described example is provided at GitHub. To run this demo you would need a HDP 2.2 Sandbox, which can be downloaded for various virtualization environments here. Continue reading “Hive Streaming with Storm” →

HDP Ansible Playbook Example

In my existing collection of automated install scripts for HDP I always try to extend it with further examples of different provisioners, providers, and settings. Recently I added with hdp22-n1-centos6-ansible an example Ansible environment for preparing a HDP 2.2 installation on one node.

Ansible differs from other provisioners like Puppet or Chef by a simplified approach in dependence on SSH. It behaves almost as a distributed shell putting little dependencies on existing hosts. Where for example Puppet makes strong assumptions about the current state of a system with one or multiple nodes, does Ansible more or less reflect a collection of tasks a system gone through to reach it’s current state. While some celebrate Ansible for it’s simplicity do others abandon it for it’s lack of strong integrity.

In this post I would like to share a sample Ansible Playbook to prepare a HDP 2.2 Amabri installation using Vagrant with Virtualbox. You can download and view the in this post discussed example here. Continue reading “HDP Ansible Playbook Example” →

Sliding Apache Cassandra Onto YARN

With the most recent release of Hadoop (2.6) comes the support for long running applications on YARN. Apache Slider is a tool that supports you in creating, managing, and monitoring long running applications, without necessarily changing anything about the way your application works. In a previous blog post I tried to go through the different aspects of long running applications Slider tries to resolve. You might also consider watching this webinar about using Slider.

By starting to release my slider demo app, which uses Apache Cassandra, here, I would like to walk through some of the required packaging steps making it run on a YARN cluster in this blog post. For this example I use a three node test cluster, which you can easily setup with this script using Vagrant and VirtualBox. Continue reading “Sliding Apache Cassandra Onto YARN” →

Try Now: HDP 2.2 on Windows Azure

HDInsight the Hadoop cloud offering from Windows Azure is a great way to use BigData as a service solutions, but there is more. With the general availability of HDP 2.2 announced this week it is great opportunity to extend the existing HDP Vagrant collection with the Windows Azure provider. In this blog post I want to demonstrate the needed steps to quickly setup a 6 node Hadoop cluster using the provided script. Except for preliminary setup steps it only takes a little adjustment of the Vagrantfile and two commands to setup the whole cluster.

Our 5 node cluster will consist of two master nodes, three data nodes, and one edge node with the Apache Knox gateway installed among other client libraries. Let’s jump in right now. Continue reading “Try Now: HDP 2.2 on Windows Azure” →

Creating a HDP Ready CentOS Image for Azure

There already exist some collections of VM images in Windows Azure gallery. Additionally on VM Depot the community can share custom provisioned images. Looking for a HDP ready image based on CentOS I could not find one suitable for my needs. In this post I would like to describe how I created my first HDP ready image based on CentOS for Windows Azure. The image will be created within Azure itself, as I don’t have access to Hyper-V and any attempts to create a CentOS based VHD with VirtualBox failed for unknown reasons. Continue reading “Creating a HDP Ready CentOS Image for Azure” →