Browsing HDP Public Repo with s3cmd

From time to time it can be very useful to be able to search for HDP repository release directly from the public repo. Especially if you want to search for a recent development or technical preview version. Also this can become handy if you need to create an offline repository for your company intranet.

The HDP repositories are available through Amazon’s S3 storage layer. A tool quite convenient to use it s3cmd.

After downloading it s3cmd can easily be installed based on python:

// requires python-setuptools
$ cd ~/Downloads/
$ tar xfz s3cmd-1.5.2.tar.gz
$ cd s3cmd-1.5.2
$ more INSTALL // to read INSTALL guide
$ sudo python install

Browing HDP repo:

$ s3cmd ls s3://

Using wildcards:

$ s3cmd ls s3://
DIR   s3://
DIR   s3://
DIR   s3://
DIR   s3://
DIR   s3://
DIR   s3://
2013-07-09 00:06  0  s3://

Wildcards can be used for filtering:

$ s3cmd ls s3://*
DIR   s3://
DIR   s3://
DIR   s3://
DIR   s3://
DIR   s3://
DIR   s3://
DIR   s3://
DIR   s3://
DIR   s3://

For help:

$ s3cmd --help

Further Readings

Automated Blueprint Install with Ambari Shell

Ambari Shell is an interactive command line tool to administrate Ambari manged HDP clusters. It supports all available functionality provided by the UI of the Ambari web application. Written as a Java application based on a Groovy REST client it further provides tab completion and a context aware commands. In a previous post we already discussed various contexts like service and state will using REST calls to alter them. Ambari Shell is a convenient tool for managing most of the complex aspects discussed there.

With that it can also be used for automated cluster installs based on Ambari Blueprints. While it is fairly simple to use two curl request to do a blueprint based install, Ambari Shell gives the advantage of monitoring the process. In scripted setups and with the use of provisioning tools like Puppet, Chef, or Ansible it gives the possibility to time setup steps after a complete cluster install. Executing a cluster install with –exitOnFinish true will halt the execution of the script until the install has finished.

An example of this is used as part of this Dockerfile where a parameterized script The below example is being used as part of a Puppet install triggered with Vagrant:



java -jar /vagrant/bin/ambari-shell.jar << EOF
blueprint add --file /vagrant/blueprints/${blueprint_name}/blueprint.json
cluster build --blueprint ${blueprint_name}
cluster assign --hostGroup node_1 --host one.hdp
cluster create --exitOnFinish true

sleep 60

Further Readings

Kafka Security with Kerberos

Apache Kafka developed as a durable and fast messaging queue handling real-time data feeds originally did not come with any security approach. Similar to Hadoop Kafka at the beginning was expected to be used in a trusted environment focusing on functionality instead of compliance. With the ever growing popularity and the widespread use of Kafka the community recently picked up traction around a complete security design including authentication with Kerberos and SSL, encryption, and authorization. Judging by the details of the security proposal found here the complete security measures will be included with the 0.9 release of Apache Kafka.

The releases of HDP 2.3 already today support a secure implementation of Kafka with authentication and authorization. Especially the integration with the security framework Apache Ranger this becomes a comprehensive security solution for any Hadoop deployment with real-time data demands. In this post we by example look at how working with a kerberized Kafka broker is different from before. Here working with the known shell tools and a custom Java producer. Continue reading “Kafka Security with Kerberos”

10 Resources for Deep Dive Into Apache Flink

Around 2009 the Stratosphere research project started at the TU Berlin which a few years later was set to become the Apache Flink project. Often compared with Apache Spark in addition to that Apache Flink offers pipelining (inter-operator parallism) to better suite incremental data processing making it more suitable for stream processing. In total the Stratosphere project aimed to provide the following contributions to Big Data processing. Most of it can be found in Flink today:

1 – High-level, declarative language for data analyisis
2 – “in suit” data analysis for external data sources3 – Richer set of primitives as MapReduce
4 – UDFs as first class citizens
5 – Query optimization
6 – Support for iterative processing
7 – Execution engine (Nephele) with external memory query processing

The Architecture of Stratosphere:

The Stratosphere software stack

This posts contains 10 resource highlighting the building foundation of Apache Flink today. Continue reading “10 Resources for Deep Dive Into Apache Flink”

Distcp Between kerberized and none-kerberized Cluster

The standard tool for copying data between two clusters is probably distcp. It can also be used to keep the data of two clusters updated. Here the update process is a asynchronous process using a fairly basic update strategy. Distcp is a simple tool, but some edge cases can get complicated. For once the distributed copy between two HA clusters is such a case. Also important to know is that since the versions of RPC used by HDFS can be different it is always a good idea to use a read only protocol like hftp or webhdfs to copy the data from the source system. So the URL could look like this hftp://source.cluster/users/me . WebHDFS would also work, because it is not using RPC.

Another corner case using distcp is the need to copy data between a secure and none secure cluster. Such a process should always be triggered from the secure cluster. This would be the cluster the owner of the cluster has a valid ticket to authenticate against the secure cluster. But this would still yield an exception as the system would complain about a missing fallback mechanism. On the secure cluster it is important to set the  ipc.client.fallback-to-simple-auth-allowed to true in the core-site.xml  in order to make this work.


What is left to to is make sure the user has the needed right on both systems to read and write the data.

Storm Serialization with Avro (using Kryo Serializer)

Working with complex data events can be a challenge designing Storm topologies for real-time data processing. In such cases emitting single values for multiple and varying event characteristics soon reveals it’s limitations. For message serialization Storm leverages the Kryo serialization framework used by many other projects. Kryo keeps a registry of serializers being used for corresponding Class types. Mappings in that registry can be overridden or added making the framework extendable to diverse type serializations.

On the other hand Avro is a very popular “data serialization system” that bridges between many different programming languages and tools. While the fact that data objects can be described in JSON makes it really easy to use, Avro is often being used for it’s support of schema evolution. With support for schema evolution the same implementation (Storm topology) could be capable of reading different versions of the same data event without adaptation. This makes it a very good fit for Storm as a intermediator between data ingestion points and data storage in today’s Enterprise Data Architectures.

Storm Enterprise Data Architecture
Storm Enterprise Data Architecture

The example here does not provide complex event samples to illustrated that point, but it gives an end to end implementation of a Storm topology where events get send to a Kafka queue as Avro objects processesed natively by a real-time processing topology. The example can be found here. It’s a simple Hive Streaming example where stock events are read from a CSV file and send to Kafka. Stock events are a flat, none complex data type as already mentioned, but we’ll still use it to demo serialization with using Avro. Continue reading “Storm Serialization with Avro (using Kryo Serializer)”

Install HDP with Red Hat Satellite

As part of the installation of HDP with Ambari two repositories get generated with the URLs defined as user input during the first steps of the install wizard and distributed to the cluster hosts. In cases where you are using Red Hat Satellite to manage your Linux infrastructure, you need to disable the repositories defined to leverage Red Hat Satellite. The same is also true for SUSE’s Manager (Spacewalk).

Prior to the install and prior to starting the Ambari server you need to disable the repositories by altering the template responsible for generating them.

Prior to Ambari 2.x you would need to change repo_suse_rhel.j2 template to disable the generated repositories. In that file simply change the enabled=1 to enabled=0. To find the template file do $ find /var/lib/ambari-server -name repo_suse_rhel.j2 .

Starting with Ambari 2.x the configuration for the repositories can be found in the cluster-evn.xml under /var/lib/ambari-server/resources/stacks/HDP/2.0.6/configuration. Also here change the value of enbaled to 0. In that file look for the <name>repo_suse_rhel_template</name> .

Save the changes and start you install. Continue reading “Install HDP with Red Hat Satellite”

Storm Flux: Easy Streaming Deployment

With Flux for Apache Storm deploying streaming topologies for real-time processing becomes less programmatic and more declarative. Using Flux for deployments makes it less likely you will have to re-compile your project just because you have re-configured or re-arranged your topology. It leverages YAML, a human-readable serialization format, to describe a topology on a whole. You might still need to write some classes, but by taking advantage of existing, generic implementations this becomes less likely.

While Flux can also be used with an existing topology, for this post we’ll take the Hive-Streaming example you can find here (blog post) to create the required topology from scratch using Flux. For experiments and demo purposes you can use the following Vagrant setup to run a HDP cluster locally. Continue reading “Storm Flux: Easy Streaming Deployment”

JPMML Example Random Forest

The Predictive Model Markup Language (PMML) developed by the Data Mining Group is a standardized XML-based representation of mining models to be used and shared across languages or tools. The standardized definition allows a classification model trained with R to be used with Storm for example. Many projects related to Big Data have some support for PMML, which is often implemented by JPMML. Continue reading “JPMML Example Random Forest”

A Java Agent Example (-javaagent)

Since Java 5 developers have the possibility to define so called pre-main hooks to manipulate the execution of a Java program at runtime with Java agents. An agent as part of the classpath is triggered before execution of the main method and therefor can be used to either filter calls to or even manipulate the underlying Java code. A tool for code manipulation is javassists. Apache Ranger for example is using both java agents and javassits to override the authorization mechanism of components of the Hadoop stack. This together with Ranger Stacks could also be used to secure existing code unchanged during runtime.

In this post we are going to look a very basic example of using Java agents to manipulate existing code. Continue reading “A Java Agent Example (-javaagent)”