Kafka Security with Kerberos

Apache Kafka developed as a durable and fast messaging queue handling real-time data feeds originally did not come with any security approach. Similar to Hadoop Kafka at the beginning was expected to be used in a trusted environment focusing on functionality instead of compliance. With the ever growing popularity and the widespread use of Kafka the community recently picked up traction around a complete security design including authentication with Kerberos and SSL, encryption, and authorization. Judging by the details of the security proposal found here the complete security measures will be included with the 0.9 release of Apache Kafka.

The releases of HDP 2.3 already today support a secure implementation of Kafka with authentication and authorization. Especially the integration with the security framework Apache Ranger this becomes a comprehensive security solution for any Hadoop deployment with real-time data demands. In this post we by example look at how working with a kerberized Kafka broker is different from before. Here working with the known shell tools and a custom Java producer. Continue reading “Kafka Security with Kerberos” →

Building HDP on Docker

Docker is a great tool that automates the deployment of software across a Linux operating system. While the fundamental idea behind Docker is to stack specialized software together to form a complex system, there is no particular rule of how big or small the software for a container can or should be. Running the complete HDP stack in a single container can be achieved as well as running each service of HDP in it’s own container.

Docker allows you to run applications inside containers. Running an application inside a container takes a single command: docker run. Containers are based off of images defining software packages and configurations. hkropp/hdp-basic is such an image in which the HDP services are running. The image was build using Ambari blueprint orchastrated by a Dockerfile. The hostname was specified to be n1.hdp throughout the build process and hence needs also to be specified when running it. The Dockerfile for this image is located here. This posts describes how to build HDP on top of Docker.

Prerequisite Setup

Before getting started a Docker environment needs to be installed. A quick way to get started is Boot2Docker. Boot2Docker is a VirtualBox image based on Tiny Core Linux with Docker installed. It can be used with Mac OS X or Windows. Other ways to install Docker can be found here.

Boot2Docker

Once installed Boot2Docker can be used via command line tool boot2docker. With it we can initialize the VM, boot it up, and prepare our shell for docker.

# getting help
$ boot2docker
Usage: boot2docker [<options>] {help|init|up|ssh|save|down|poweroff|reset|restart|config|status|info|ip|shellinit|delete|download|upgrade|version} [<args>]

# init a VM with 8GB RAM and 8 CPUs
$ boot2docker init --memory=8192 --cpus=8

# boot up the image
$ boot2docker up

# shutdown the vm
$ boot2docker down

# setup the shell
$ boot2docker shellinit

# delete the vm completely (to use again an init required)
$ boot2docker delete

# test running
$ docker version
Client version: 1.7.0
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 0baf609
OS/Arch (client): darwin/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d
OS/Arch (server): linux/amd64

Running hdp-basic

With the Docker environment setup the image can be run like this:

$ docker run -d 
-p 8080:8080 
-h n1.hdp 
hkropp/hdp-basic:0.1 
/start-server 

Unable to find image 'hkropp/hdp-basic:0.1' locally
0.1: Pulling from hkropp/hdp-basic

If not already installed locally this will fetch the image from Docker Hub. After that the image is run in daemon mode as the -d flag indicates. The -p flag lets Docker know to expose this port to the host VM. With this Ambari can be accessed using the $ boot2docker ip and port 8080 – http://$(boot2docker ip):8080 The hostname is set to be n1.hdp because the image was configured with this hostname. By executing the /start-server script at boot time the Ambari server is started together with all installed services.

The Dockerfile

Building this image was achieved using this Dockerfile, while the installation of HDP was done using Ambari Shell with Blueprints. Helpful about Ambari Shell is the fact that an blueprint install can be executed blocking further process until the install has finished (–exitOnFinish true). From the install-cluster.sh script:

java -jar /tmp/ambari-shell.jar --ambari.host=$HOST << EOF
blueprint add --file /tmp/blueprint.json
cluster build --blueprint hdp-basic
cluster assign --hostGroup host_group_1 --host $HOST
cluster create --exitOnFinish true
EOF

The image is based from a centos:6.6 image. Throughout the build a consistent hostname is being used for the configuration and installation. Doing this with Docker builds is actually not very easy to achieve. By design Docker tries to make the context a container can run in as less restrictive as possible. Assigning a fixed host name to an image is restricting these context. In addition every build step creates a new image with a new host name. Setting the host name before each step requires root privileges which are not given. To work around this the ENV command was used to set the HOSTNAME and to make it resolvable before any command that required the hostname a script was executed to set it as part of the /etc/hosts file.

Part of the Dockerfile:

# OS
FROM centos:6.6

# Hostname Help
ENV HOSTNAME n1.hdp
ADD set_host.sh /tmp/

...

RUN /tmp/set_host.sh && /tmp/install-cluster.sh

Part of the set_host.sh:

#!/bin/bash

echo $(head -1 /etc/hosts | cut -f1) n1.hdp >> /etc/hosts

The Ambari agents support dynamic host configuration by defining a script.

Dockerfile:

# Setup networking for Ambari agent/server
ADD hostname.sh /etc/ambari-agent/conf/hostname.sh
#RUN sed -i "s/hostname=.*/hostname=n1.hdp/" /etc/ambari-agent/conf/ambari-agent.ini
RUN sed -i "/[agent]/ a public_hostname_script=/etc/ambari-agent/conf/hostname.sh" /etc/ambari-agent/conf/ambari-agent.ini
RUN sed -i "/[agent]/ a hostname_script=/etc/ambari-agent/conf/hostname.sh" /etc/ambari-agent/conf/ambari-agent.ini
RUN sed -i "s/agent.task.timeout=900/agent.task.timeout=2000/" /etc/ambari-server/conf/ambari.properties

hostname.sh:

#!/bin/bash

# echo $(hostname -f) # for dynamic host name
echo "n1.hdp"

Starting HDP

start-server is the script that is executed during startup of the container. Here the Ambari server and agent are started. The Ambari Shell is again being used to start up the all installed HDP services.

#!/bin/bash

while [ -z "$(netstat -tulpn | grep 8080)" ]; do
  ambari-server start
  ambari-agent start
  sleep 5
done

sleep 5

java -jar /tmp/ambari-shell.jar --ambari.host=n1.hdp << EOF
services start
EOF

while true; do
  sleep 3
  tail -f /var/log/ambari-server/ambari-server.log
done

HiveSink for Flume

With the most recent release of HDP (v2.2.4) Hive Streaming is shipped as technical preview. It can for example be used with Storm to ingest streaming data collected from Kafka as demonstrated here. But it also still has some serious limitations and in case of Storm a major bug. Nevertheless Hive Streaming is likely to become the tool of choice when it comes to streamline data ingestion to Hadoop. So it is worth to explore already today.

Flume’s upcoming release 1.6 will contain a HiveSink capable of leveraging Hive Streaming for data ingestion. In the following post we will use it as a replacement for the HDFS sink used in a previous post here. Other then replacing the HDFS sink with a HiveSink none of the previous setup will change, except for Hive table schema which needs to be adjusted as part of the requirements that currently exist around Hive Streaming. So let’s get started by looking into these restrictions. Continue reading “HiveSink for Flume” →

Distcp between two HA Cluster

With HDFS High Availability two nodes can act as a NameNode to the system, but not at the same time. Only one of the nodes acts as the active NameNode at any point in time while the other node is in a standby state. The standby node only acts as a slave node preserving enough state to immediately take over when the active node dies. In that it differs from the before existing SecondaryNamenode which was not able to immediately take over.

From a client perspective most confusing is the fact how the active NameNode is discovered? How is HDFS High Availability configured? In this post we look at how for example to distribute data between two clusters in HA mode. Continue reading “Distcp between two HA Cluster” →

Hive Streaming with Storm

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released as a Technical Preview to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally has been a batch oriented storage.

Based on the newly given ACID insert/update capabilities of Hive, the Streaming API is breaking down a stream of data into smaller batches which get committed in a transaction to the underlying storage. Once committed the data becomes immediately available for other queries.

Broadly speaking the API consists of two parts. One part is handling the transaction while the other is dealing with the underlying storage (HDFS). Transactions in Hive are handled by the the Metastore. Kerberos is supported from the beginning!

Some of the current limitations are:

Only delimited input data and JSON (strict syntax) are supported
Only ORC support
Hive table must be bucketed (unpartitioned tables are supported)

In this post I would like to demonstrate the use of a newly created Storm HiveBolt that makes use of the streaming API and is quite straightforward to use. The source of the here described example is provided at GitHub. To run this demo you would need a HDP 2.2 Sandbox, which can be downloaded for various virtualization environments here. Continue reading “Hive Streaming with Storm” →

HDP Ansible Playbook Example

In my existing collection of automated install scripts for HDP I always try to extend it with further examples of different provisioners, providers, and settings. Recently I added with hdp22-n1-centos6-ansible an example Ansible environment for preparing a HDP 2.2 installation on one node.

Ansible differs from other provisioners like Puppet or Chef by a simplified approach in dependence on SSH. It behaves almost as a distributed shell putting little dependencies on existing hosts. Where for example Puppet makes strong assumptions about the current state of a system with one or multiple nodes, does Ansible more or less reflect a collection of tasks a system gone through to reach it’s current state. While some celebrate Ansible for it’s simplicity do others abandon it for it’s lack of strong integrity.

In this post I would like to share a sample Ansible Playbook to prepare a HDP 2.2 Amabri installation using Vagrant with Virtualbox. You can download and view the in this post discussed example here. Continue reading “HDP Ansible Playbook Example” →

Try Now: HDP 2.2 on Windows Azure

HDInsight the Hadoop cloud offering from Windows Azure is a great way to use BigData as a service solutions, but there is more. With the general availability of HDP 2.2 announced this week it is great opportunity to extend the existing HDP Vagrant collection with the Windows Azure provider. In this blog post I want to demonstrate the needed steps to quickly setup a 6 node Hadoop cluster using the provided script. Except for preliminary setup steps it only takes a little adjustment of the Vagrantfile and two commands to setup the whole cluster.

Our 5 node cluster will consist of two master nodes, three data nodes, and one edge node with the Apache Knox gateway installed among other client libraries. Let’s jump in right now. Continue reading “Try Now: HDP 2.2 on Windows Azure” →

Installing HttpFS Gateway on a Kerberized Cluster

HttpFS gateway is the preferred way of accessing the Hadoop filesystem using HTTP clients like curl. Additionally it can be used from from the hadoop fs command line tool ultimately being a replacement for the hftp protocol. HttpFS, unlike HDFS Proxy, has full support for all file operations with additional support for authentication. Given it’s stateless protocol it is ideal to scale out Hadoop filesystem access using HTTP clients.

In this post I would like to show how to install and setup a HttpFS gateway on a secure and kerberized cluster. By providing some troubleshooting topics, this post should also help you, when running into problems while installing the gateway. Continue reading “Installing HttpFS Gateway on a Kerberized Cluster” →

Securing Your Datalake With Apache Argus – Part 1

Apache Argus, the Apache open source project, with it’s comprehensive security offering for today’s Hadoop installations is likely to become an important cornerstone of modern enterprise BigData architectures. It’s by today already quite sophisticate compared to other product offerings.

Key aspects of Argus are the Administration, Authorization, and Audit Logging covering most security demands. In the future we might even see Data Protection (encryption) as well.

Argus consists of four major components that tied together build a secure layer around your Hadoop installation. Within Argus it is the Administration Portal, a web application, that is capable of managing and accessing the Audit Server and Policy Manager, also two important components of Apache Argus. At the client side or a the Hadoop services like the HiveServer2 or the NameNode Argus installs specific agents that encapsulate requests based on the policies specified.

A key aspect of Argus is, that the clients don’t have to request the Policy Server on every single client call, but are updated in a certain interval. This improves the scalability and also ensures that clients continue working even when the Policy Server is down.

Let’s go ahead an install a most recent version of Apache Argus using the HDP Sandbox 2.1. By installing the Policy Manager, Hive, and HDFS Agent you should have a pretty good idea of how Argus operates and a pretty solid environment to test specific use cases.

In this part we’ll only install the Policy Manager of Argus synced together with our OpenLdap installation for user and group management. We will use our kerberized HDP Sandbox throughout this post. Continue reading “Securing Your Datalake With Apache Argus – Part 1” →

Kerberized Hadoop Cluster – A Sandbox Example

The groundwork of any secure system installation is a strong authentication. It is the process of verifying the identity of a user by comparing known factors. Factors can be:

Shared Knowledge
A password or the answer to a question. It’s the most common and not seldom the only factor used by computer systems today.
Biometric Attributes
For example fingerprints or iris pattern
Items One Possess
A Smart Card or phone. Phone is probably one of the most common factors in use today aside a shared knowledge.

A system that takes more than one factor into account for authentication is also know as a multi-factor authentication system. Knowing the identity of a user up to a specific certainty can not be overestimated.

All other components of a save environment, like Authorization, Audit, Data Protection, and Administration, heavily rely on a strong authentication. Authorization or Auditing only make sense if the identity of a user can not be compromised. In Hadoop today there exist solution for nearly all aspects of enterprise grade security layers, especially with the event of Apache Argus. Continue reading “Kerberized Hadoop Cluster – A Sandbox Example” →

Category: Hadoop