OpenLDAP Setup with CA Signed Certificate on CentOS

A central directory service is a common fragment of Enterprise IT infrastructures. Frequently companies organize their complete user management through a directory service, giving them the comfort of SSO. This makes it a requirement for services shared by corporate users to seamlessly integrate with the authentication service. The integration of a directory service – may it be an OpenLDAP, Apache Directory Server, or Active Directory – is one of the most common cornerstones of a Hadoop installation.

In up coming posts I am going to highlight some of the necessary steps for a dependable integration of Hadoop in today’s secure enterprise infrastructures including a demonstration of Apache Argus. As a preliminary step we are going to revisit some basic principals in this post that comprises a secure PKI, and a central OpenLDAP directory service. The knowledge of this is going to be presumed going forward. In this post CentOS is used as the operation system. Continue reading “OpenLDAP Setup with CA Signed Certificate on CentOS”

10+ Resources For A Deep Dive Into Spark

Spark, initially an amplab project, is widely seen as the next top compute model for distributed processing. It elaborates strongly on the actor model provided through Akka. Some already argue it is going to replace everything there is – namely MapReduce. While this is hardly going to be the case, without any doubt Spark will become a core asset of modern data architectures. Here you’ll find a collection of 10 resources eligible for a deep dive into Spark:

  1. Spark: Cluster Computing with Working Sets
    One of the first publications about Spark from 2010.
  2. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
    A publication about one of the core concepts of Spark, the resilient distributed dataset (RDD).
  3. Disk-Locality in Datacenter Computing Considered Irrelevant
    Current distributed approaches are mainly centered around the concept of data (disk) locality. Especially MapReduce is based on this concept. The authors of this publication argue for a shift away from disk-locality to memory-locality in today’s distributed environments.
  4. GraphX: A Resilient Distributed Graph System on Spark
    A very promising use case apart from ML is the use of Spark for large scale graph analysis.
  5. Spark at Twitter – Seattle Spark Meetup, April 2014
    Twitter shares some of their viewpoints and the lessons they have learned.
  6. MLlib is a Spark Implementation of some common machine learning algorithms
  7. Discretized Streams: Fault-Tolerant Streaming Computation at Scale
    Reactive Akka Streams
  8. Shark makes Hive faster and more powerful
  9. Running Spark on YARN
    YARN (amazon book) as the operating system of tomorrows data architectures was particularly designed for different compute models as Spark.
  10. Spark SQL unifies access to structured data
  11. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
  12. Spark Packages
  13. Managing Data Transfers in Computer Clusters with
    Orchestra (default broadcast mechanism)

Sliding Applications onto YARN

Along with a Hadoop cluster installation usually come some well established services which are part of certain use cases. Rarely is it possible to fully satisfy complex use cases by only applying MapReduce. There could be ElasticSearch for search or a Cassandra cluster for indexing. This and other complementary components, like HBase, Storm, or Hive, of a Hadoop cluster bring the burden of additional complexity when it comes to cluster planing, management, or monitoring. Think for example of the memory planning of a Datanode also running Cassandra. You would have to choose upfront of how many of the given memory you allocate to each. Think of what also will happen as you remove or add new Cassandra nodes to the cluster?

YARN was designed to manage different sets of workloads on a Hadoop setup aside MapReduce. So with modern Hadoop installations the solution to deal with the above challenges means to port the needed services to YARN. Some of the common services have been or are being ported to YARN in a YARN-Ready program led by Hortonworks. As porting existing services to YARN can be by it’s own quite challenging Apache Slider (incubating) was developed to support long-running services by YARN without the requirement to make any changes. Apache Slider’s promise is to run this applications inside YARN unchanged.
Continue reading “Sliding Applications onto YARN”

Provisioning a HDP Dev Cluster with Vagrant

Setting up a production or development Hadoop cluster used to be much more tedious then it is today with tools like Puppet, Chef, and Vagrant. Additionally the Hadoop community kept busy investing in the ease of deployments listening to demands of experienced system administrators. The latest of such investments is Ambari Blueprints.

With Ambari Blueprints dev-ops are capable of configuring an automated setup of individual components on each node across a cluster. This further can be re-used to replicate the setup on to different clusters for development, integration, or production.

In this post we are going to setup up a three node HDP 2.1 cluster for development on a local machine by using Vagrant and Ambari.
Most of what will be presented here builds up on previous work published by various author which are referenced at the end of this post. Continue reading “Provisioning a HDP Dev Cluster with Vagrant”

Hadoop Security: 10 Resources To Get You Started

As Hadoop emerges into the center of todays enterprise data architecture, security becomes a critical requirement. This can be witnessed by the most recent acquisitions of leading Hadoop vendors and also by the numerous projects centered around security that have been launched or are getting more traction recently.

Here are 10 resources to get you started about the topic:

  1. Hadoop Security Design (2009 White Paper)
  2. Hadoop Security Design? – Just Add Kerberos? Really?(Black Hat 2010)
  3. Hadoop Poses a Big Data Security Risk: 10 Reasons Why
  4. Apache Knox – A gateway for Hadoop clusters
  5. Apache Argus
  6. Project Rhino
  7. Protegrity Big Data Protector
  8. Dataguise for Hadoop
  9. Secure JDBC and ODBC Clients’ Access to HiveServer2
  10. InfoSphere Optim Data Masking

Further Readings

Responsive D3.js Modules with AngularJS

Trying to build a modular web application for data visualization using D3.js can be quite daunting. Certainly D3 offers event listeners, but arranging them in reusable modules for the requirements of todays interactive applications seems tedious. In such a scenario AngularJS can be of great help in creating responsive visualization for the web. By using AngularJS Directives nothing else but web standards like HTML, CSS, and SVG can be used to build powerful data driven applications.

Do demonstrate the possibilities of integrating D3.js as a Directive in an AngularJS analytics dashboard we are going to plot some access statistics of this blog, which I’ve exported from an analytics tool before. Continue reading “Responsive D3.js Modules with AngularJS”

Iron Blogger: In for a Perfect Game

Since this years re:publica 2014 conference I joined the Iron Blogger community here in Munich. Like many good decisions this was a spontaneous move motivated by a talk of a former co-worker given at the re:publica 2014 conference. His topic was about winning back the Internet by decentralizing the content – a core concept of the Internet – from centralized providers. By relying on third party products like facebook, tumblr, Twitter, and G+ we loose the autonomy of the content we publish.

While this is a genuine intend it was not the driving force behind my motivation to become an Iron Blogger. All it took to get me involved was a side note of him mentioning that he started to write more recently as he had joined the Iron Blogger movement. This caught my attention and I was curious to find out more as I long ago wanted to publish more post myself. It took me not more than a couple of minutes to decide that I wanted to try this. Here I would like to share my experience so far and the new set goal – In for a Perfect Game. Continue reading “Iron Blogger: In for a Perfect Game”

Training Multiple SVM Classifiers with Apache Pig

Inspired by Twitter‘s publication about “Large Scale Machine Learning” I turned to Pig when it came to implement a SVM classifier for Record Linkage. Searching for different solutions I also came across a presentation of the Huffington Post using a similar approach to training multiple SVM models. The overall idea is to use Hadoop to train multiple models with different parameters at the same time, selecting the best model for the actual classification. There are some limitations to this approach, which I’ll try to address at the end of this post, but first let me describe my approach to training multiple SVM classifiers with Pig.

Disclaimer: This post does not describe the process of training one model in parallel but training multiple models at the same time on multiple machines.
Continue reading “Training Multiple SVM Classifiers with Apache Pig”

Python Virtualenv with Hadoop Streaming

If you are using Python with Hadoop Streaming a lot then you might know about the trouble of keeping all nodes up to date with required packages. A nice way to work around this is to use Virtualenv for each streaming project. Besides the hurdle of keeping all nodes in sync with the necessary libraries another advantage of using Virtualenv is the possibility to try different versions and setups within the same project seamlessly.

In this example we are going to create a Python job that counts the n-grams of hotel names in relation to the country the hotel is located in. Besides the use of a Virtualenv where we install NLTK, we are going to strive the use of Avro as an input for a Python streaming job, as well as secondary sorting with the use of KeyFieldBasedPartitioner and KeyFieldBasedComparator . Continue reading “Python Virtualenv with Hadoop Streaming”

Using Hive from R with JDBC

RHadoop is probably one of the best ways to take advantage of Hadoop from R by making use of Hadoop’s Streaming capabilities. Another possibility to make R work with Big Data in general is the use of SQL with for example a JDBC connector. For Hive there exists such a possibility with the Hive Server 2 Client JDBC. In combination with UDFs this has the potential to be quite a powerful approach to leverage the best of the two. In this post I would like to demonstrate the preliminary steps necessary to make R and Hive work.

If you have the Hortonworks Sandbox setup you should be able to simply follow along as you read. If not you probably are able to adapt where appropriate. First we’ll have to install R on a machine with access to Hive. By default this means the machine should be able to access port 1000 or 1001 where the Hive server is installed. Next we are going to use a sample table in Hive to query from R setting up all required packages.

Continue reading “Using Hive from R with JDBC”