Collection of HDP Vagrant Scripts

After writing about provisioning Hadoop cluster with Vagrant I started a collection of cluster setups using the HDP distribution. The examples use different versions, operating systems, Vagrant providers, and node sizes. With Ambari blueprints different scenarios can be provided with a simple command. With this post I would like to share this scripts using Github here. In addition with the event of HDP 2.2 two examples using the technical preview version of HDP were added to this repository: https://github.com/hkropp/vagrant-hdp

The naming convention for each environment is as follows:

{vagrant_provider}/{distribution}-{node_size}-{os}-{provisioner}

The environments can be run and setup with the one simple command:

vagrant up && ./install_blueprint.sh

As a requirement VirtualBox and Vagrant need to be installed.

The master_blueprint.json contains possible Ambari blueprint components and configurations.

Examples at Github: https://github.com/hkropp/vagrant-hdp

 

Advertisement

Tech. Preview: HDP 2.2

The up-coming release of HDP 2.2 will contain some important forward-facing changes to the Hadoop Platform. Together with partners Hortonworks is shaping the future of Big Data in an open community. Looking at some of the key new features we get a pretty clear picture of what the future of Hadoop is going to look like. The quickest way to get started now is to download the HDP Sandbox here. Continue reading “Tech. Preview: HDP 2.2”

Securing Your Datalake With Apache Argus – Part 1

Apache Argus, the Apache open source project, with it’s comprehensive security offering for today’s Hadoop installations is likely to become an important cornerstone of modern enterprise BigData architectures. It’s by today already quite sophisticate compared to other product offerings.

Key aspects of Argus are the Administration, Authorization, and Audit Logging covering most security demands. In the future we might even see Data Protection (encryption) as well.

Argus a Comphrensive ApproachArgus consists of four major components that tied together build a secure layer around your Hadoop installation. Within Argus it is the Administration Portal, a web application, that is capable of managing and accessing the Audit Server and Policy Manager, also two important components of Apache Argus. At the client side or a the Hadoop services like the HiveServer2 or the NameNode Argus installs specific agents that encapsulate requests based on the policies specified.

Argus Architectur OverviewA key aspect of Argus is, that the clients don’t have to request the Policy Server on every single client call, but are updated in a certain interval. This improves the scalability and also ensures that clients continue working even when the Policy Server is down.

Let’s go ahead an install a most recent version of Apache Argus using the HDP Sandbox 2.1. By installing the Policy Manager, Hive, and HDFS Agent you should have a pretty good idea of how Argus operates and a pretty solid environment to test specific use cases.

In this part we’ll only install the Policy Manager of Argus synced together with our OpenLdap installation for user and group management. We will use our kerberized HDP Sandbox throughout this post. Continue reading “Securing Your Datalake With Apache Argus – Part 1”

Kerberized Hadoop Cluster – A Sandbox Example

The groundwork of any secure system installation is a strong authentication. It is the process of verifying the identity of a user by comparing known factors. Factors can be:

  1. Shared Knowledge
    A password or the answer to a question. It’s the most common and not seldom the only factor used by computer systems today.
  2. Biometric Attributes
    For example fingerprints or iris pattern
  3. Items One Possess
    A Smart Card or phone. Phone is probably one of the most common factors in use today aside a shared knowledge.

A system that takes more than one factor into account for authentication is also know as a multi-factor authentication system. Knowing the identity of a user up to a specific certainty can not be overestimated.

All other components of a save environment, like Authorization, Audit, Data Protection, and Administration, heavily rely on a strong authentication. Authorization or Auditing only make sense if the identity of a user can not be compromised. In Hadoop today there exist solution for nearly all aspects of enterprise grade security layers, especially with the event of Apache Argus. Continue reading “Kerberized Hadoop Cluster – A Sandbox Example”