My Agenda for Hadoop Summit 2015 in Brussels

Next week starts Hadoop Summit 2015 in Brussels. A week packed full of interesting talks additionally aligned with diverse community events to connect with amazing people from the field it makes up a fun Hadoop event ahead. The “Birds of Feahter Sessions” and the “Hadoop Crash Course” are great opportunities for beginners. Continue reading “My Agenda for Hadoop Summit 2015 in Brussels” →

Apache Kafka: Queuing for Hadoop

Apache Kafka is a distributed system designed for streams that is often being categorized as a messaging system but provides a fundamentally different abstraction, although it serves a similar role. The key abstraction of Kafka to keep in mind is a structured commit log of events. With events being any kind of system, user, or machine emitted data. Kafka is built to:

Fault-tolerant
High throughput
Horizontally scalable
Allow geographically distributing data streams and processing.

A constantly growing number of data generated at today companies is event data. While there is an approach to combine machine generated data under the umbrella term of
Internet-of-Things (IoT) it is crucial to understand that business is inherently event driven.
A purchase, a customer claim or registration are just examples of such events. Business is interactive. When analyzing the data time matters. Most of this data has it’s highest value when analyzed close to or even in real-time.

Apache Kafka was created to solve two main problems that arise from the ever increasing demand for stream data processing. Designed for reliability Kafka is capable of scaling against the growing demand for events passing. Secondly Kafka can interact with various applications and platform for the same events, which helps to orchestrated today’s complex architectures providing a central message hub for each system. Today chances are that all that data will end up in Hadoop for further or even real time analyses making Kafka a queue to Hadoop. Continue reading “Apache Kafka: Queuing for Hadoop” →

Spark Streaming – A Simple Example

Streamline data processing has become an inherent part of a modern data architecture build on top of Hadoop. Big Data applications need to act on data being ingested at a high rate and volume in real time. Sensor data, logs and other events likely have the most value when being analyst at the time they are emited – in real time.

For over 6 years Apache Storm has matured at hundreds of global companies to the preferred stream processing engine on top of Hadoop. In a more recent approach Spark Streaming was published build around Spark‘s Resilient Distributed Datasets (RDD). Constructed with the same concepts as Spark, a in-memory batch compute enginge, Spark Streaming is offering the clear advantage of bringing batch and real time processing closer together, as ideally the same code base can be leveraged for both.

Hence Spark Streaming is a so called micro-batching framework that uses timed intervals. It uses so called D-Streams (Discretized Stream) that structure computation as small sets of short, stateless, and deterministic tasks. State is distributed and stored in fault-tolerant RDDs. A D-Stream can be build from various data sources as Kafka, Flume, or HDFS offering many of the same operations available for RDDs with additional operations typical for time operations such as sliding windows.

In this post we are looking at a fairly basic example of using Spark Streaming. We will listen to a server emitting line by line in this example. Continue reading “Spark Streaming – A Simple Example” →

Distcp between two HA Cluster

With HDFS High Availability two nodes can act as a NameNode to the system, but not at the same time. Only one of the nodes acts as the active NameNode at any point in time while the other node is in a standby state. The standby node only acts as a slave node preserving enough state to immediately take over when the active node dies. In that it differs from the before existing SecondaryNamenode which was not able to immediately take over.

From a client perspective most confusing is the fact how the active NameNode is discovered? How is HDFS High Availability configured? In this post we look at how for example to distribute data between two clusters in HA mode. Continue reading “Distcp between two HA Cluster” →

Hong Kong

10 Resources About Storm

The event processing framework Apache Storm is the preferred approach for real-time Big Data. In use at large companies around the world it proofs it’s maturity every day at scale. This post collects some of the resources helpful in understanding what Storm is and contains also some sources that highlight the special relationship Kafka enjoys.

Continue reading “10 Resources About Storm” →

Azure VNet for Your HDP Cluster

In a series of blog posts I demonstrated how to create a custom OS Image for automatic provisioning of HDP with Vagrant on the Azure Cloud. On GitHub I share the result of a first layout among other provisioning scripts. Until now the setup was done without the proper network configuration required for the communication of the individual components of the cluster.

With the new release of the vagrant-azure plugin it will be possible to setup the cluster in a dedicated VNet. This is the last missing peace in the series of work I published to allow the automated provisioning of HDP in Azure. Unfortunately this is not quite true, as the current Ruby SDK of Azure does not allow the passing of IP addresses to the machines. We therefor have to create host entries currently by hand. It could be possible to setup a DNS or use Puppet to conduct the host mapping in a automated fashion, but I at least was not able to do so as part of this work here. Continue reading “Azure VNet for Your HDP Cluster” →

Ambari Augeas Module and Configuration

Operating large to small size clusters needs automation; as well does installation. Going repeatedly through installation instructions, typing the same commands over and over again is really no fun. Worse it’s a waste of resources. The same holds true for keeping the configuration of 10 or 100 nodes in sync. A first step is to use a distributed shell and the next to turn to tools like Puppet, Cobblerd, Ansible, or Chef, which promise full automation of cluster operations.

As with most automation processes this tools tend to make the common parts easy and the not so common steps hard. The fact that every setup is different requires concepts of adaptability that are easy to use. One of this tools for adapting configuration files could be seen in Augeas. Although hard to understand at time Augeas is easy to use, as it turns arbitrary configuration files into trees on which typical tree operations can be performed.

This post highlights the possibilities to us Augeas in a Puppet setup to install and configure a HDP Ambari cluster. Ambari agents are part of every node in a cluster, so when adding new nodes to a cluster in an automated fashion you would want to be prepared. Continue reading “Ambari Augeas Module and Configuration” →

Automated Ambari Install with Existing Database

If not specified differently Amabari server will be install together with a Postgres database for it’s meta-data among a MySQL database for the Hive Metastore. A Derby database will also be installed for storing scheduling information of Oozie workflows. This post covers the installation of Ambari server with an existing database. In support of an automated install with Ambari Blueprints configuring the Hive Metastore and Oozie will also be provided. Continue reading “Automated Ambari Install with Existing Database” →

Sliding Apache Cassandra Onto YARN

With the most recent release of Hadoop (2.6) comes the support for long running applications on YARN. Apache Slider is a tool that supports you in creating, managing, and monitoring long running applications, without necessarily changing anything about the way your application works. In a previous blog post I tried to go through the different aspects of long running applications Slider tries to resolve. You might also consider watching this webinar about using Slider.

By starting to release my slider demo app, which uses Apache Cassandra, here, I would like to walk through some of the required packaging steps making it run on a YARN cluster in this blog post. For this example I use a three node test cluster, which you can easily setup with this script using Vagrant and VirtualBox. Continue reading “Sliding Apache Cassandra Onto YARN” →