Storm Serialization with Avro (using Kryo Serializer)

Working with complex data events can be a challenge designing Storm topologies for real-time data processing. In such cases emitting single values for multiple and varying event characteristics soon reveals it’s limitations. For message serialization Storm leverages the Kryo serialization framework used by many other projects. Kryo keeps a registry of serializers being used for corresponding Class types. Mappings in that registry can be overridden or added making the framework extendable to diverse type serializations.

On the other hand Avro is a very popular “data serialization system” that bridges between many different programming languages and tools. While the fact that data objects can be described in JSON makes it really easy to use, Avro is often being used for it’s support of schema evolution. With support for schema evolution the same implementation (Storm topology) could be capable of reading different versions of the same data event without adaptation. This makes it a very good fit for Storm as a intermediator between data ingestion points and data storage in today’s Enterprise Data Architectures.

Storm Enterprise Data Architecture
Storm Enterprise Data Architecture

The example here does not provide complex event samples to illustrated that point, but it gives an end to end implementation of a Storm topology where events get send to a Kafka queue as Avro objects processesed natively by a real-time processing topology. The example can be found here. It’s a simple Hive Streaming example where stock events are read from a CSV file and send to Kafka. Stock events are a flat, none complex data type as already mentioned, but we’ll still use it to demo serialization with using Avro. Continue reading “Storm Serialization with Avro (using Kryo Serializer)”

Install HDP with Red Hat Satellite

As part of the installation of HDP with Ambari two repositories get generated with the URLs defined as user input during the first steps of the install wizard and distributed to the cluster hosts. In cases where you are using Red Hat Satellite to manage your Linux infrastructure, you need to disable the repositories defined to leverage Red Hat Satellite. The same is also true for SUSE’s Manager (Spacewalk).

Prior to the install and prior to starting the Ambari server you need to disable the repositories by altering the template responsible for generating them.

Prior to Ambari 2.x you would need to change repo_suse_rhel.j2 template to disable the generated repositories. In that file simply change the enabled=1 to enabled=0. To find the template file do $ find /var/lib/ambari-server -name repo_suse_rhel.j2 .

Starting with Ambari 2.x the configuration for the repositories can be found in the cluster-evn.xml under /var/lib/ambari-server/resources/stacks/HDP/2.0.6/configuration. Also here change the value of enbaled to 0. In that file look for the <name>repo_suse_rhel_template</name> .

Save the changes and start you install. Continue reading “Install HDP with Red Hat Satellite”

Storm Flux: Easy Streaming Deployment

With Flux for Apache Storm deploying streaming topologies for real-time processing becomes less programmatic and more declarative. Using Flux for deployments makes it less likely you will have to re-compile your project just because you have re-configured or re-arranged your topology. It leverages YAML, a human-readable serialization format, to describe a topology on a whole. You might still need to write some classes, but by taking advantage of existing, generic implementations this becomes less likely.

While Flux can also be used with an existing topology, for this post we’ll take the Hive-Streaming example you can find here (blog post) to create the required topology from scratch using Flux. For experiments and demo purposes you can use the following Vagrant setup to run a HDP cluster locally. Continue reading “Storm Flux: Easy Streaming Deployment”

JPMML Example Random Forest

The Predictive Model Markup Language (PMML) developed by the Data Mining Group is a standardized XML-based representation of mining models to be used and shared across languages or tools. The standardized definition allows a classification model trained with R to be used with Storm for example. Many projects related to Big Data have some support for PMML, which is often implemented by JPMML. Continue reading “JPMML Example Random Forest”

A Java Agent Example (-javaagent)

Since Java 5 developers have the possibility to define so called pre-main hooks to manipulate the execution of a Java program at runtime with Java agents. An agent as part of the classpath is triggered before execution of the main method and therefor can be used to either filter calls to or even manipulate the underlying Java code. A tool for code manipulation is javassists. Apache Ranger for example is using both java agents and javassits to override the authorization mechanism of components of the Hadoop stack. This together with Ranger Stacks could also be used to secure existing code unchanged during runtime.

In this post we are going to look a very basic example of using Java agents to manipulate existing code. Continue reading “A Java Agent Example (-javaagent)”

Building HDP on Docker

Docker is a great tool that automates the deployment of software across a Linux operating system. While the fundamental idea behind Docker is to stack specialized software together to form a complex system, there is no particular rule of how big or small the software for a container can or should be. Running the complete HDP stack in a single container can be achieved as well as running each service of HDP in it’s own container.

Docker allows you to run applications inside containers. Running an application inside a container takes a single command: docker run. Containers are based off of images defining software packages and configurations. hkropp/hdp-basic is such an image in which the HDP services are running. The image was build using Ambari blueprint orchastrated by a Dockerfile. The hostname was specified to be n1.hdp throughout the build process and hence needs also to be specified when running it. The Dockerfile for this image is located here. This posts describes how to build HDP on top of Docker.

Prerequisite Setup

Before getting started a Docker environment needs to be installed. A quick way to get started is Boot2Docker. Boot2Docker is a VirtualBox image based on Tiny Core Linux with Docker installed. It can be used with Mac OS X or Windows. Other ways to install Docker can be found here.

Boot2Docker

Once installed Boot2Docker can be used via command line tool boot2docker. With it we can initialize the VM, boot it up, and prepare our shell for docker.

# getting help
$ boot2docker
Usage: boot2docker [<options>] {help|init|up|ssh|save|down|poweroff|reset|restart|config|status|info|ip|shellinit|delete|download|upgrade|version} [<args>]

# init a VM with 8GB RAM and 8 CPUs
$ boot2docker init --memory=8192 --cpus=8

# boot up the image
$ boot2docker up

# shutdown the vm
$ boot2docker down

# setup the shell
$ boot2docker shellinit

# delete the vm completely (to use again an init required)
$ boot2docker delete

# test running
$ docker version
Client version: 1.7.0
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 0baf609
OS/Arch (client): darwin/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d
OS/Arch (server): linux/amd64

Running hdp-basic

With the Docker environment setup the image can be run like this:

$ docker run -d 
-p 8080:8080 
-h n1.hdp 
hkropp/hdp-basic:0.1 
/start-server 

Unable to find image 'hkropp/hdp-basic:0.1' locally
0.1: Pulling from hkropp/hdp-basic

If not already installed locally this will fetch the image from Docker Hub. After that the image is run in daemon mode as the -d  flag indicates. The -p flag lets Docker know to expose this port to the host VM. With this Ambari can be accessed using the $ boot2docker ip  and port 8080 – http://$(boot2docker ip):8080 The hostname is set to be n1.hdp because the image was configured with this hostname. By executing the /start-server script at boot time the Ambari server is started together with all installed services.

The Dockerfile

Building this image was achieved using this Dockerfile, while the installation of HDP was done using Ambari Shell with Blueprints. Helpful about Ambari Shell is the fact that an blueprint install can be executed blocking further process until the install has finished (–exitOnFinish true). From the install-cluster.sh script:

java -jar /tmp/ambari-shell.jar --ambari.host=$HOST << EOF
blueprint add --file /tmp/blueprint.json
cluster build --blueprint hdp-basic
cluster assign --hostGroup host_group_1 --host $HOST
cluster create --exitOnFinish true
EOF

The image is based from a centos:6.6 image. Throughout the build a consistent hostname is being used for the configuration and installation. Doing this with Docker builds is actually not very easy to achieve. By design Docker tries to make the context a container can run in as less restrictive as possible. Assigning a fixed host name to an image is restricting these context. In addition every build step creates a new image with a new host name. Setting the host name before each step requires root privileges which are not given. To work around this the ENV command was used to set the HOSTNAME and to make it resolvable before any command that required the hostname a script was executed to set it as part of the /etc/hosts file.

Part of the Dockerfile:

# OS
FROM centos:6.6

# Hostname Help
ENV HOSTNAME n1.hdp
ADD set_host.sh /tmp/

...

RUN /tmp/set_host.sh && /tmp/install-cluster.sh

Part of the set_host.sh:

#!/bin/bash

echo $(head -1 /etc/hosts | cut -f1) n1.hdp >> /etc/hosts

The Ambari agents support dynamic host configuration by defining a script.

Dockerfile:

# Setup networking for Ambari agent/server
ADD hostname.sh /etc/ambari-agent/conf/hostname.sh
#RUN sed -i "s/hostname=.*/hostname=n1.hdp/" /etc/ambari-agent/conf/ambari-agent.ini
RUN sed -i "/[agent]/ a public_hostname_script=/etc/ambari-agent/conf/hostname.sh" /etc/ambari-agent/conf/ambari-agent.ini
RUN sed -i "/[agent]/ a hostname_script=/etc/ambari-agent/conf/hostname.sh" /etc/ambari-agent/conf/ambari-agent.ini
RUN sed -i "s/agent.task.timeout=900/agent.task.timeout=2000/" /etc/ambari-server/conf/ambari.properties

hostname.sh:

#!/bin/bash

# echo $(hostname -f) # for dynamic host name
echo "n1.hdp"

Starting HDP

start-server is the script that is executed during startup of the container. Here the Ambari server and agent are started. The Ambari Shell is again being used to start up the all installed HDP services.

#!/bin/bash

while [ -z "$(netstat -tulpn | grep 8080)" ]; do
  ambari-server start
  ambari-agent start
  sleep 5
done

sleep 5

java -jar /tmp/ambari-shell.jar --ambari.host=n1.hdp << EOF
services start
EOF

while true; do
  sleep 3
  tail -f /var/log/ambari-server/ambari-server.log
done

Further Readings

Install Apache Zepplin via REST & Ambari

The Ambari server offers a comprehensive REST API to install a complete HDP cluster or manage all parts of it. In this post we are going to explore the possibilities of installing a new service. With Ambari it is fairly easy to define custom services for management. With YARN being a general purpose execution engine Ambari can be seen as the general purpose management service. For this example we are using the Apache Zeppelin service provided here. A more general documentation of how to install a new service to an existing cluster can be found here. Continue reading “Install Apache Zepplin via REST & Ambari”

Upgrade Docker to Master on OSx

Docker is a fast moving project enjoying a lot of popularity among developers across all branches. With this wide support the Docker ecosystem is evolving almost every day reaching from Deis as a PaaS platform, cluster management in CoreOS and Kubernets. Even Microsoft is considering Docker support for their next version of Microsoft Server. In this post I would like to demonstrate how to upgrade to a master release of Docker running on Mac OSx with Boot2Docker which comes handy when trying to keep up with the latest development or move a around a current bug already fixed in a new release. Likely the here stated notes will also be useful with other environments. Continue reading “Upgrade Docker to Master on OSx”

Installing Ranger with Ambari Blueprints

With the new release of HDP 2.3 comes Ambari 2.1 that brings among other improvements the provisioning and management of Apache Ranger. Ranger together with new agents for a centralized authorization management brings a new KMS key storage for HDFS encryption. HDP components in Ambari can be installed and configured through blueprints that are described in a JSON notation. Continue reading “Installing Ranger with Ambari Blueprints”

Services and State with Ambari REST API

The Ambari management tool for Hadoop offers among other handy tools a comprehensive REST API for cluster administration. Logically a cluster is divided into hosts, services and service components. While the UI might not always has support for all needed scenarios sure the REST API can be used to achieve it. For example moving a master component of a service from one host to another.

In this post we are going to look a little closer at the way the Ambari API can be used to manage Hadoop services. At the end of this post you will find a list of all the currently supported Hadoop services with all the needed master, slave and client components that can be manged and administrated within your HDP stack. Also this posts contains the possible states and state transitions a component might have which could become useful when facing problems like Host config is in invalid state. Continue reading “Services and State with Ambari REST API”