Vagrant has become very popular for provisioning virtual machines for development. Usually it’s used in combination with VirtualBox on a local machine. But Vagrant supports multiple other visualization providers, in fact one can build a custom provider as needed. If the local machine is not sufficient for the needs of development moving to the cloud seems like a reasonable thing to do using AWS, Rackspace, or OpenStack. Continue reading “Provisioning a Cluster Using Vagrant and OpenStack”
Category: General
Apache Knox: A Hadoop Bastion
Lately a lot of effort went into making Hadoop setups more secure for enterprise ready installations. With Apache Knox comes a connecting strap for your cluster that acts like a bastion server shielding direct access to your nodes. Knox is stateless and can therefor easily scale horizontally with the obvious limitation of also just supporting stateless protocols. Knox provides the following functionality:
- Authentication
Users and groups can be managed using LDAP or Active Directory - Federation/SSO
Knox uses HTTP header based identity federation - Authorization
Authorization is mainly supported on service level through access control lists (ACL) - Auditing
Access through Knox is audited for
Here we are going to explore the necessary steps for a Knox setup. In this setup the authentication process is going through a LDAP directory service running on the same node as Knox while separated from the Hadoop cluster. Knox comes with an embedded Apache Directory for demo purposes. You can also read here on how to setup a secure OpenLDAP. Knox LDAP service can be started like this:
cd {KNOX_HOME}
bin/ldap.sh start
Here we are going to explorer necessary steps to setup Apache Know for your environment. Continue reading “Apache Knox: A Hadoop Bastion”
OpenLDAP Setup with CA Signed Certificate on CentOS
A central directory service is a common fragment of Enterprise IT infrastructures. Frequently companies organize their complete user management through a directory service, giving them the comfort of SSO. This makes it a requirement for services shared by corporate users to seamlessly integrate with the authentication service. The integration of a directory service – may it be an OpenLDAP, Apache Directory Server, or Active Directory – is one of the most common cornerstones of a Hadoop installation.
In up coming posts I am going to highlight some of the necessary steps for a dependable integration of Hadoop in today’s secure enterprise infrastructures including a demonstration of Apache Argus. As a preliminary step we are going to revisit some basic principals in this post that comprises a secure PKI, and a central OpenLDAP directory service. The knowledge of this is going to be presumed going forward. In this post CentOS is used as the operation system. Continue reading “OpenLDAP Setup with CA Signed Certificate on CentOS”
10+ Resources For A Deep Dive Into Spark
Spark, initially an amplab project, is widely seen as the next top compute model for distributed processing. It elaborates strongly on the actor model provided through Akka. Some already argue it is going to replace everything there is – namely MapReduce. While this is hardly going to be the case, without any doubt Spark will become a core asset of modern data architectures. Here you’ll find a collection of 10 resources eligible for a deep dive into Spark:
- Spark: Cluster Computing with Working Sets
One of the first publications about Spark from 2010. - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
A publication about one of the core concepts of Spark, the resilient distributed dataset (RDD). - Disk-Locality in Datacenter Computing Considered Irrelevant
Current distributed approaches are mainly centered around the concept of data (disk) locality. Especially MapReduce is based on this concept. The authors of this publication argue for a shift away from disk-locality to memory-locality in today’s distributed environments. - GraphX: A Resilient Distributed Graph System on Spark
A very promising use case apart from ML is the use of Spark for large scale graph analysis. - Spark at Twitter – Seattle Spark Meetup, April 2014
Twitter shares some of their viewpoints and the lessons they have learned. - MLlib is a Spark Implementation of some common machine learning algorithms
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Reactive Akka Streams - Shark makes Hive faster and more powerful
- Running Spark on YARN
YARN (amazon book) as the operating system of tomorrows data architectures was particularly designed for different compute models as Spark. - Spark SQL unifies access to structured data
- BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
- Spark Packages
- Managing Data Transfers in Computer Clusters with
Orchestra (default broadcast mechanism)
Sliding Applications onto YARN
Along with a Hadoop cluster installation usually come some well established services which are part of certain use cases. Rarely is it possible to fully satisfy complex use cases by only applying MapReduce. There could be ElasticSearch for search or a Cassandra cluster for indexing. This and other complementary components, like HBase, Storm, or Hive, of a Hadoop cluster bring the burden of additional complexity when it comes to cluster planing, management, or monitoring. Think for example of the memory planning of a Datanode also running Cassandra. You would have to choose upfront of how many of the given memory you allocate to each. Think of what also will happen as you remove or add new Cassandra nodes to the cluster?
YARN was designed to manage different sets of workloads on a Hadoop setup aside MapReduce. So with modern Hadoop installations the solution to deal with the above challenges means to port the needed services to YARN. Some of the common services have been or are being ported to YARN in a YARN-Ready program led by Hortonworks. As porting existing services to YARN can be by it’s own quite challenging Apache Slider (incubating) was developed to support long-running services by YARN without the requirement to make any changes. Apache Slider’s promise is to run this applications inside YARN unchanged.
Continue reading “Sliding Applications onto YARN”
Provisioning a HDP Dev Cluster with Vagrant
Setting up a production or development Hadoop cluster used to be much more tedious then it is today with tools like Puppet, Chef, and Vagrant. Additionally the Hadoop community kept busy investing in the ease of deployments listening to demands of experienced system administrators. The latest of such investments is Ambari Blueprints.
With Ambari Blueprints dev-ops are capable of configuring an automated setup of individual components on each node across a cluster. This further can be re-used to replicate the setup on to different clusters for development, integration, or production.
In this post we are going to setup up a three node HDP 2.1 cluster for development on a local machine by using Vagrant and Ambari.
Most of what will be presented here builds up on previous work published by various author which are referenced at the end of this post. Continue reading “Provisioning a HDP Dev Cluster with Vagrant”
Hadoop Security: 10 Resources To Get You Started
As Hadoop emerges into the center of todays enterprise data architecture, security becomes a critical requirement. This can be witnessed by the most recent acquisitions of leading Hadoop vendors and also by the numerous projects centered around security that have been launched or are getting more traction recently.
Here are 10 resources to get you started about the topic:
- Hadoop Security Design (2009 White Paper)
- Hadoop Security Design? – Just Add Kerberos? Really?(Black Hat 2010)
- Hadoop Poses a Big Data Security Risk: 10 Reasons Why
- Apache Knox – A gateway for Hadoop clusters
- Apache Argus
- Project Rhino
- Protegrity Big Data Protector
- Dataguise for Hadoop
- Secure JDBC and ODBC Clients’ Access to HiveServer2
- InfoSphere Optim Data Masking
Further Readings
Responsive D3.js Modules with AngularJS
Trying to build a modular web application for data visualization using D3.js can be quite daunting. Certainly D3 offers event listeners, but arranging them in reusable modules for the requirements of todays interactive applications seems tedious. In such a scenario AngularJS can be of great help in creating responsive visualization for the web. By using AngularJS Directives nothing else but web standards like HTML, CSS, and SVG can be used to build powerful data driven applications.
Do demonstrate the possibilities of integrating D3.js as a Directive in an AngularJS analytics dashboard we are going to plot some access statistics of this blog, which I’ve exported from an analytics tool before. Continue reading “Responsive D3.js Modules with AngularJS”
Iron Blogger: In for a Perfect Game
Since this years re:publica 2014 conference I joined the Iron Blogger community here in Munich. Like many good decisions this was a spontaneous move motivated by a talk of a former co-worker given at the re:publica 2014 conference. His topic was about winning back the Internet by decentralizing the content – a core concept of the Internet – from centralized providers. By relying on third party products like facebook, tumblr, Twitter, and G+ we loose the autonomy of the content we publish.
While this is a genuine intend it was not the driving force behind my motivation to become an Iron Blogger. All it took to get me involved was a side note of him mentioning that he started to write more recently as he had joined the Iron Blogger movement. This caught my attention and I was curious to find out more as I long ago wanted to publish more post myself. It took me not more than a couple of minutes to decide that I wanted to try this. Here I would like to share my experience so far and the new set goal – In for a Perfect Game. Continue reading “Iron Blogger: In for a Perfect Game”
Training Multiple SVM Classifiers with Apache Pig
Inspired by Twitter‘s publication about “Large Scale Machine Learning” I turned to Pig when it came to implement a SVM classifier for Record Linkage. Searching for different solutions I also came across a presentation of the Huffington Post using a similar approach to training multiple SVM models. The overall idea is to use Hadoop to train multiple models with different parameters at the same time, selecting the best model for the actual classification. There are some limitations to this approach, which I’ll try to address at the end of this post, but first let me describe my approach to training multiple SVM classifiers with Pig.
Disclaimer: This post does not describe the process of training one model in parallel but training multiple models at the same time on multiple machines.
Continue reading “Training Multiple SVM Classifiers with Apache Pig”