Along with a Hadoop cluster installation usually come some well established services which are part of certain use cases. Rarely is it possible to fully satisfy complex use cases by only applying MapReduce. There could be ElasticSearch for search or a Cassandra cluster for indexing. This and other complementary components, like HBase, Storm, or Hive, of a Hadoop cluster bring the burden of additional complexity when it comes to cluster planing, management, or monitoring. Think for example of the memory planning of a Datanode also running Cassandra. You would have to choose upfront of how many of the given memory you allocate to each. Think of what also will happen as you remove or add new Cassandra nodes to the cluster?
YARN was designed to manage different sets of workloads on a Hadoop setup aside MapReduce. So with modern Hadoop installations the solution to deal with the above challenges means to port the needed services to YARN. Some of the common services have been or are being ported to YARN in a YARN-Ready program led by Hortonworks. As porting existing services to YARN can be by it’s own quite challenging Apache Slider (incubating) was developed to support long-running services by YARN without the requirement to make any changes. Apache Slider’s promise is to run this applications inside YARN unchanged.
Continue reading “Sliding Applications onto YARN” →
Setting up a production or development Hadoop cluster used to be much more tedious then it is today with tools like Puppet, Chef, and Vagrant. Additionally the Hadoop community kept busy investing in the ease of deployments listening to demands of experienced system administrators. The latest of such investments is Ambari Blueprints.
With Ambari Blueprints dev-ops are capable of configuring an automated setup of individual components on each node across a cluster. This further can be re-used to replicate the setup on to different clusters for development, integration, or production.
In this post we are going to setup up a three node HDP 2.1 cluster for development on a local machine by using Vagrant and Ambari.
Most of what will be presented here builds up on previous work published by various author which are referenced at the end of this post. Continue reading “Provisioning a HDP Dev Cluster with Vagrant” →
As Hadoop emerges into the center of todays enterprise data architecture, security becomes a critical requirement. This can be witnessed by the most recent acquisitions of leading Hadoop vendors and also by the numerous projects centered around security that have been launched or are getting more traction recently.
Here are 10 resources to get you started about the topic:
- Hadoop Security Design (2009 White Paper)
- Hadoop Security Design? – Just Add Kerberos? Really?(Black Hat 2010)
- Hadoop Poses a Big Data Security Risk: 10 Reasons Why
- Apache Knox – A gateway for Hadoop clusters
- Apache Argus
- Project Rhino
- Protegrity Big Data Protector
- Dataguise for Hadoop
- Secure JDBC and ODBC Clients’ Access to HiveServer2
- InfoSphere Optim Data Masking
Trying to build a modular web application for data visualization using D3.js can be quite daunting. Certainly D3 offers event listeners, but arranging them in reusable modules for the requirements of todays interactive applications seems tedious. In such a scenario AngularJS can be of great help in creating responsive visualization for the web. By using AngularJS Directives nothing else but web standards like HTML, CSS, and SVG can be used to build powerful data driven applications.
Do demonstrate the possibilities of integrating D3.js as a Directive in an AngularJS analytics dashboard we are going to plot some access statistics of this blog, which I’ve exported from an analytics tool before. Continue reading “Responsive D3.js Modules with AngularJS” →
Since this years re:publica 2014 conference I joined the Iron Blogger community here in Munich. Like many good decisions this was a spontaneous move motivated by a talk of a former co-worker given at the re:publica 2014 conference. His topic was about winning back the Internet by decentralizing the content – a core concept of the Internet – from centralized providers. By relying on third party products like facebook, tumblr, Twitter, and G+ we loose the autonomy of the content we publish.
While this is a genuine intend it was not the driving force behind my motivation to become an Iron Blogger. All it took to get me involved was a side note of him mentioning that he started to write more recently as he had joined the Iron Blogger movement. This caught my attention and I was curious to find out more as I long ago wanted to publish more post myself. It took me not more than a couple of minutes to decide that I wanted to try this. Here I would like to share my experience so far and the new set goal – In for a Perfect Game. Continue reading “Iron Blogger: In for a Perfect Game” →