10 Resources About Storm

The event processing framework Apache Storm is the preferred approach for real-time Big Data. In use at large companies around the world it proofs it’s maturity every day at scale. This post collects some of the resources helpful in understanding what Storm is and contains also some sources that highlight the special relationship Kafka enjoys.

  1. Storm: Distributed and Fault-Tolerant Real-time Computation
  2. Evaluating persistent, replicated message queues (updated w/ Kafka)
  3. Apache storm vs. Spark Streaming
  4. Storm and Spark at Yahoo: Why Chose One Over the Other
  5. Common Topology Patterns
  6. Trident Tutorial
  7. Trident-ML
  8. Storm-R
  9. Storm-Pattern
  10. ZooKeeper Curator Client & Hint
  11. Hortonworks Storm as part of HDP

Continue reading “10 Resources About Storm”

Azure VNet for Your HDP Cluster

In a series of blog posts I demonstrated how to create a custom OS Image for automatic provisioning of HDP with Vagrant on the Azure Cloud. On GitHub I share the result of a first layout among other provisioning scripts. Until now the setup was done without the proper network configuration required for the communication of the individual components of the cluster.

With the new release of the vagrant-azure plugin it will be possible to setup the cluster in a dedicated VNet. This is the last missing peace in the series of work I published to allow the automated provisioning of HDP in Azure. Unfortunately this is not quite true, as the current Ruby SDK of Azure does not allow the passing of IP addresses to the machines. We therefor have to create host entries currently by hand. It could be possible to setup a DNS or use Puppet to conduct the host mapping in a automated fashion, but I at least was not able to do so as part of this work here. Continue reading “Azure VNet for Your HDP Cluster”

Ambari Augeas Module and Configuration

Operating large to small size clusters needs automation; as well does installation. Going repeatedly through installation instructions, typing the same commands over and over again is really no fun. Worse it’s a waste of resources. The same holds true for keeping the configuration of 10 or 100 nodes in sync. A first step is to use a distributed shell and the next to turn to tools like Puppet, Cobblerd, Ansible, or Chef, which promise full automation of cluster operations.

As with most automation processes this tools tend to make the common parts easy and the not so common steps hard. The fact that every setup is different requires concepts of  adaptability that are easy to use. One of this tools for adapting configuration files could be seen in Augeas. Although hard to understand at time Augeas is easy to use, as it turns arbitrary configuration files into trees on which typical tree operations can be performed.

This post highlights the possibilities to us Augeas in a Puppet setup to install and configure a HDP Ambari cluster. Ambari agents are part of every node in a cluster, so when adding new nodes to a cluster in an automated fashion you would want to be prepared. Continue reading “Ambari Augeas Module and Configuration”

Automated Ambari Install with Existing Database

If not specified differently Amabari server will be install together with a Postgres database for it’s meta-data among a MySQL database for the Hive Metastore. A Derby database will also be installed for storing scheduling information of Oozie workflows. This post covers the installation of Ambari server with an existing database. In support of an automated install with Ambari Blueprints configuring the Hive Metastore and Oozie will also be provided. Continue reading “Automated Ambari Install with Existing Database”