Creating a HDP Ready CentOS Image for Azure

There already exist some collections of VM images in Windows Azure gallery. Additionally on VM Depot the community can share custom provisioned images. Looking for a HDP ready image based on CentOS I could not find one suitable for my needs. In this post I would like to describe how I created my first HDP ready image based on CentOS for Windows Azure. The image will be created within Azure itself, as I don’t have access to Hyper-V and any attempts to create a CentOS based VHD with VirtualBox failed for unknown reasons. Continue reading “Creating a HDP Ready CentOS Image for Azure”


Installing HttpFS Gateway on a Kerberized Cluster

HttpFS gateway is the preferred way of accessing the Hadoop filesystem using HTTP clients like curl. Additionally it can be used from from the hadoop fs command line tool ultimately being a replacement for the hftp protocol. HttpFS, unlike HDFS Proxy, has full support for all file operations with additional support for authentication. Given it’s stateless protocol it is ideal to scale out Hadoop filesystem access using HTTP clients.

In this post I would like to show how to install and setup a HttpFS gateway on a secure and kerberized cluster. By providing some troubleshooting topics, this post should also help you, when running into problems while installing the gateway. Continue reading “Installing HttpFS Gateway on a Kerberized Cluster”

10 Resources to Become Reactive

Today I signed the Reactive Manifesto, which is prominently backed by developers from companies like Netflix, Typesafe, Twitter, or Oracle. It is my strong believe that the ever growing size of data processing needs a coherent approach towards event driven architectures to meet today’s demands.

In the same area I see projects like Kafka around which recently a new spinoff out of LinkedIN was announced, Confluent. I also count Spark, backed by Databricks and currently seeing a lot of attention, as an example of a new generation of reactive applications.

This are some of the resources that got my attention:

  1. Reactive Manifesto
  2. Reactive Streams
  3. Advanced Reactive Programming with Akka and Scala
  4. Introducing Actors Akka Notes Part 1
  5. Clustering reactmq with Akka Cluster
  6. Replacing ZeroMQ with RTI Context DDS in an Actor Based System
  7. Evaluating Persistent Replicated Message Queues
  8. Reactive Queue with Akka Reactive Streams
  9. Making the Reactive Queue Durable with Akka Persistence
  10. Scala and the Akka Event Bus

Discover HDP 2.2 Webinar Series

With HDP 2.2 on the verge of existence it is a good idea to begin a deep-dive into the new features in Hadoop with this Webinar series:

  • Discover HDP 2.2: Data Storage Innovations in Hadoop Distributed File System (HDFS)

    (Replay / Slides)

  • Discover HDP 2.2: Learn What’s New in YARN: Reliability, Scheduling and Isolation

    (Register Now – Thursday, November 20, 2014)

  • Discover HDP 2.2: Apache Storm and Apache Kafka for Stream Data Processing

    (Replay / Slides)

  • Discover HDP 2.2: Apache HBase with YARN and Slider for Fast, NoSQL Data Access


  • Discover HDP 2.2: Using Apache Ambari to Manage Hadoop Clusters


  • Discover HDP 2.2: Apache Falcon for Hadoop Data Governance

    (Replay / Slides)

  • Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and

    (Replay / Slides)

  • Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache Knox

    (Replay / Slides)

YARN Ready! Are You?

YARN is changing the face of Big Data as we know it today. Breaking with by now well established patterns like MapReduce YARN gives clients the ability to run diverse distributed algorithms on one cluster under one resource management. In addition with Apache Slider comes the possibility to ‘slide’ existing long running service on to the same cluster with the same resource provider.

With the up-coming release of HDP 2.2 HBase services are running under the management of YARN. The same will be true for Storm getting us one step closer to the vision of an Enterprise Hadoop Data Lake. One important aspect in this scenario is yet to come: YARN-796 aka YARN Labels.

Recently the technical preview of Hortonworks Data Platform 2.2 was released with a Sandbox image for download. Giving you the possibility to try out the concepts of tomorrows Big Data platform today with this tutorials. You can also try some of the virtual environments I’ve put together here hdp22-n1-centos6-puppet or hdp22-n3-centos6-puppet.

We are experiencing the dawn of a new era in Hadoop, Hadoop v2. Are you YARN ready? Here are some resources to get you going: