Provisioning a Cluster Using Vagrant and OpenStack

Vagrant has become very popular for provisioning virtual machines for development. Usually it’s used in combination with VirtualBox on a local machine. But Vagrant supports multiple other visualization providers, in fact one can build a custom provider as needed. If the local machine is not sufficient for the needs of development moving to the cloud seems like a reasonable thing to do using AWS, Rackspace, or OpenStack. Continue reading “Provisioning a Cluster Using Vagrant and OpenStack”


Apache Knox: A Hadoop Bastion

Lately a lot of effort went into making Hadoop setups more secure for enterprise ready installations. With Apache Knox comes a connecting strap for your cluster that acts like a bastion server shielding direct access to your nodes. Knox is stateless and can therefor easily scale horizontally with the obvious limitation of also just supporting stateless protocols. Knox provides the following functionality:

  1. Authentication
    Users and groups can be managed using LDAP or Active Directory
  2. Federation/SSO
    Knox uses HTTP header based identity federation
  3. Authorization
    Authorization is mainly supported on service level through access control lists (ACL)
  4. Auditing
    Access through Knox is audited for

Here we are going to explore the necessary steps for a Knox setup. In this setup the authentication process is going through a LDAP directory service running on the same node as Knox while separated from the Hadoop cluster. Knox comes with an embedded Apache Directory for demo purposes. You can also read here on how to setup a secure OpenLDAP. Knox LDAP service can be started like this:

bin/ start

Here we are going to explorer necessary steps to setup Apache Know for your environment. Continue reading “Apache Knox: A Hadoop Bastion”

OpenLDAP Setup with CA Signed Certificate on CentOS

A central directory service is a common fragment of Enterprise IT infrastructures. Frequently companies organize their complete user management through a directory service, giving them the comfort of SSO. This makes it a requirement for services shared by corporate users to seamlessly integrate with the authentication service. The integration of a directory service – may it be an OpenLDAP, Apache Directory Server, or Active Directory – is one of the most common cornerstones of a Hadoop installation.

In up coming posts I am going to highlight some of the necessary steps for a dependable integration of Hadoop in today’s secure enterprise infrastructures including a demonstration of Apache Argus. As a preliminary step we are going to revisit some basic principals in this post that comprises a secure PKI, and a central OpenLDAP directory service. The knowledge of this is going to be presumed going forward. In this post CentOS is used as the operation system. Continue reading “OpenLDAP Setup with CA Signed Certificate on CentOS”

10+ Resources For A Deep Dive Into Spark

Spark, initially an amplab project, is widely seen as the next top compute model for distributed processing. It elaborates strongly on the actor model provided through Akka. Some already argue it is going to replace everything there is – namely MapReduce. While this is hardly going to be the case, without any doubt Spark will become a core asset of modern data architectures. Here you’ll find a collection of 10 resources eligible for a deep dive into Spark:

  1. Spark: Cluster Computing with Working Sets
    One of the first publications about Spark from 2010.
  2. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
    A publication about one of the core concepts of Spark, the resilient distributed dataset (RDD).
  3. Disk-Locality in Datacenter Computing Considered Irrelevant
    Current distributed approaches are mainly centered around the concept of data (disk) locality. Especially MapReduce is based on this concept. The authors of this publication argue for a shift away from disk-locality to memory-locality in today’s distributed environments.
  4. GraphX: A Resilient Distributed Graph System on Spark
    A very promising use case apart from ML is the use of Spark for large scale graph analysis.
  5. Spark at Twitter – Seattle Spark Meetup, April 2014
    Twitter shares some of their viewpoints and the lessons they have learned.
  6. MLlib is a Spark Implementation of some common machine learning algorithms
  7. Discretized Streams: Fault-Tolerant Streaming Computation at Scale
    Reactive Akka Streams
  8. Shark makes Hive faster and more powerful
  9. Running Spark on YARN
    YARN (amazon book) as the operating system of tomorrows data architectures was particularly designed for different compute models as Spark.
  10. Spark SQL unifies access to structured data
  11. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
  12. Spark Packages
  13. Managing Data Transfers in Computer Clusters with
    Orchestra (default broadcast mechanism)