Running PySpark with Virtualenv

Controlling the environment of an application is vital for it’s functionality and stability. Especially in a distributed environment it is important for developers to have control over the version of dependencies. In such an scenario it’s a critical task to ensure possible conflicting requirements of multiple applications are not disturbing each other.

That is why frameworks like YARN ensure that each application is executed in a self-contained environment – typically in a Linux (Java) Container or Docker Container – that is controlled by the developer. In this post we show what this means for Python environments being used by Spark. Continue reading “Running PySpark with Virtualenv”

HDP Repo with Nginx

Environments dedicated for a HDP install without connection to the internet require a dedicated HDP repository all nodes have access to. While such a setup can differ slightly depending on the connection, if they have temporary or no internet access, in any case they need a file service holding a copy of the HDP repo. Most enterprises have a dedicated infrastructure in place based on Aptly or Satellite. This post describes the setup of an Nginx host serving as a HDP repository host. Continue reading “HDP Repo with Nginx”

A Secure HDFS Client Example

It takes about 3 lines of Java code to write a simple HDFS client that can further be used to upload, read or list files. Here is an example:

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

This file system API gives the developer a generic interface to (any supported) file system depending on the protocol being use, in this case hdfs. This is enough to alter data on the Hadoop Distributed Filesystem, for example to list all the files under the root folder:

FileStatus[] fsStatus = fs.listStatus(new Path("/"));
for(int i = 0; i < fsStatus.length; i++){

For a secured environment this is not enough, because you would need to consider these further aspects:

  1. A secure protocol
  2. Authentication with Kerberos
  3. Impersonation (proxy user), if designed as a service

What we discuss here for a sample HDFS client can in variance also be applied to other Hadoop clients.

Continue reading “A Secure HDFS Client Example”

Connecting Tomcat to a Kerberized HDP Cluster

At some point you might require to connect your dashboard, data ingestion service or similar to a secured and kerberized HDP cluster. Most Java based webcontainers do support Kerberos for both client and server side communication. Kerberos does require very thoughtful configuration but rewards it’s users with an almost completely transparent authentication implementation that simply works. Steps described in this post should enable you to connect your application with a secured HDP cluster. For further support read the links listed at the end of this writing. A sample project is provided on github for hands-on exercises. Continue reading “Connecting Tomcat to a Kerberized HDP Cluster”

Kafka Security with Kerberos

Apache Kafka developed as a durable and fast messaging queue handling real-time data feeds originally did not come with any security approach. Similar to Hadoop Kafka at the beginning was expected to be used in a trusted environment focusing on functionality instead of compliance. With the ever growing popularity and the widespread use of Kafka the community recently picked up traction around a complete security design including authentication with Kerberos and SSL, encryption, and authorization. Judging by the details of the security proposal found here the complete security measures will be included with the 0.9 release of Apache Kafka.

The releases of HDP 2.3 already today support a secure implementation of Kafka with authentication and authorization. Especially the integration with the security framework Apache Ranger this becomes a comprehensive security solution for any Hadoop deployment with real-time data demands. In this post we by example look at how working with a kerberized Kafka broker is different from before. Here working with the known shell tools and a custom Java producer. Continue reading “Kafka Security with Kerberos”

Building HDP on Docker

Docker is a great tool that automates the deployment of software across a Linux operating system. While the fundamental idea behind Docker is to stack specialized software together to form a complex system, there is no particular rule of how big or small the software for a container can or should be. Running the complete HDP stack in a single container can be achieved as well as running each service of HDP in it’s own container.

Docker allows you to run applications inside containers. Running an application inside a container takes a single command: docker run. Containers are based off of images defining software packages and configurations. hkropp/hdp-basic is such an image in which the HDP services are running. The image was build using Ambari blueprint orchastrated by a Dockerfile. The hostname was specified to be n1.hdp throughout the build process and hence needs also to be specified when running it. The Dockerfile for this image is located here. This posts describes how to build HDP on top of Docker.

Prerequisite Setup

Before getting started a Docker environment needs to be installed. A quick way to get started is Boot2Docker. Boot2Docker is a VirtualBox image based on Tiny Core Linux with Docker installed. It can be used with Mac OS X or Windows. Other ways to install Docker can be found here.


Once installed Boot2Docker can be used via command line tool boot2docker. With it we can initialize the VM, boot it up, and prepare our shell for docker.

# getting help
$ boot2docker
Usage: boot2docker [<options>] {help|init|up|ssh|save|down|poweroff|reset|restart|config|status|info|ip|shellinit|delete|download|upgrade|version} [<args>]

# init a VM with 8GB RAM and 8 CPUs
$ boot2docker init --memory=8192 --cpus=8

# boot up the image
$ boot2docker up

# shutdown the vm
$ boot2docker down

# setup the shell
$ boot2docker shellinit

# delete the vm completely (to use again an init required)
$ boot2docker delete

# test running
$ docker version
Client version: 1.7.0
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 0baf609
OS/Arch (client): darwin/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d
OS/Arch (server): linux/amd64

Running hdp-basic

With the Docker environment setup the image can be run like this:

$ docker run -d 
-p 8080:8080 
-h n1.hdp 

Unable to find image 'hkropp/hdp-basic:0.1' locally
0.1: Pulling from hkropp/hdp-basic

If not already installed locally this will fetch the image from Docker Hub. After that the image is run in daemon mode as the -d  flag indicates. The -p flag lets Docker know to expose this port to the host VM. With this Ambari can be accessed using the $ boot2docker ip  and port 8080 – http://$(boot2docker ip):8080 The hostname is set to be n1.hdp because the image was configured with this hostname. By executing the /start-server script at boot time the Ambari server is started together with all installed services.

The Dockerfile

Building this image was achieved using this Dockerfile, while the installation of HDP was done using Ambari Shell with Blueprints. Helpful about Ambari Shell is the fact that an blueprint install can be executed blocking further process until the install has finished (–exitOnFinish true). From the script:

java -jar /tmp/ambari-shell.jar$HOST << EOF
blueprint add --file /tmp/blueprint.json
cluster build --blueprint hdp-basic
cluster assign --hostGroup host_group_1 --host $HOST
cluster create --exitOnFinish true

The image is based from a centos:6.6 image. Throughout the build a consistent hostname is being used for the configuration and installation. Doing this with Docker builds is actually not very easy to achieve. By design Docker tries to make the context a container can run in as less restrictive as possible. Assigning a fixed host name to an image is restricting these context. In addition every build step creates a new image with a new host name. Setting the host name before each step requires root privileges which are not given. To work around this the ENV command was used to set the HOSTNAME and to make it resolvable before any command that required the hostname a script was executed to set it as part of the /etc/hosts file.

Part of the Dockerfile:

# OS
FROM centos:6.6

# Hostname Help
ADD /tmp/


RUN /tmp/ && /tmp/

Part of the


echo $(head -1 /etc/hosts | cut -f1) n1.hdp >> /etc/hosts

The Ambari agents support dynamic host configuration by defining a script.


# Setup networking for Ambari agent/server
ADD /etc/ambari-agent/conf/
#RUN sed -i "s/hostname=.*/hostname=n1.hdp/" /etc/ambari-agent/conf/ambari-agent.ini
RUN sed -i "/[agent]/ a public_hostname_script=/etc/ambari-agent/conf/" /etc/ambari-agent/conf/ambari-agent.ini
RUN sed -i "/[agent]/ a hostname_script=/etc/ambari-agent/conf/" /etc/ambari-agent/conf/ambari-agent.ini
RUN sed -i "s/agent.task.timeout=900/agent.task.timeout=2000/" /etc/ambari-server/conf/


# echo $(hostname -f) # for dynamic host name
echo "n1.hdp"

Starting HDP

start-server is the script that is executed during startup of the container. Here the Ambari server and agent are started. The Ambari Shell is again being used to start up the all installed HDP services.


while [ -z "$(netstat -tulpn | grep 8080)" ]; do
  ambari-server start
  ambari-agent start
  sleep 5

sleep 5

java -jar /tmp/ambari-shell.jar << EOF
services start

while true; do
  sleep 3
  tail -f /var/log/ambari-server/ambari-server.log

Further Readings

HiveSink for Flume

With the most recent release of HDP (v2.2.4) Hive Streaming is shipped as technical preview. It can for example be used with Storm to ingest streaming data collected from Kafka as demonstrated here. But it also still has some serious limitations and in case of Storm a major bug. Nevertheless Hive Streaming is likely to become the tool of choice when it comes to streamline data ingestion to Hadoop. So it is worth to explore already today.

Flume’s upcoming release 1.6 will contain a HiveSink capable of leveraging Hive Streaming for data ingestion. In the following post we will use it as a replacement for the HDFS sink used in a previous post here. Other then replacing the HDFS sink with a HiveSink none of the previous setup will change, except for Hive table schema which needs to be adjusted as part of the requirements that currently exist around Hive Streaming. So let’s get started by looking into these restrictions. Continue reading “HiveSink for Flume”