Provisioning a HDP Dev Cluster with Vagrant

Setting up a production or development Hadoop cluster used to be much more tedious then it is today with tools like Puppet, Chef, and Vagrant. Additionally the Hadoop community kept busy investing in the ease of deployments listening to demands of experienced system administrators. The latest of such investments is Ambari Blueprints.

With Ambari Blueprints dev-ops are capable of configuring an automated setup of individual components on each node across a cluster. This further can be re-used to replicate the setup on to different clusters for development, integration, or production.

In this post we are going to setup up a three node HDP 2.1 cluster for development on a local machine by using Vagrant and Ambari.
Most of what will be presented here builds up on previous work published by various author which are referenced at the end of this post.

HDP Setup with Vagrant

Vagrant let’s you easily setup up virtual environments in a snap completely described in code. Although Vagrant uses Ruby prior knowledge of the language is not required. Spinning up your virtual environment is as easy as running vagrant init  and vagrant up  from you command line interface. Vagrant runs your setup on VirtualBox, VMware or any other supported provider.

Central components of a Vagrant setup are a Vagrantfile and a Box. While Vagrantfiles are used to describe the individual setup in the Ruby language, is a Box a Vagrant package of such a system including the bare (or pre-installed) image of the underlying operating system. Boxes can be published and shared. A resources to find a certain Box of you need could be for example Vagrantcloud.

Here we are using a CentOS 6.5 box with pre-installed Puppet provided by Puppetlabs:

config.vm.box = "puppetlabs/centos-6.5-64-puppet"
config.vm.box_url = "http://developer.nrel.gov/downloads/vagrant-boxes/CentOS-6.4-x86_64-v20130731.box"

For provisioning Vagrant can be used with Shell, Chef, or Puppet among others. Even all of them can be used at the same time. In the here described setup we are going to use Puppet as our provisioning system to setup HDP 2.1. Each of the nodes having it’s own puppet script:

one.vm.provision "puppet" do |puppet|
  puppet.manifests_path = "manifest"
  puppet.module_path = "modules"
  puppet.manifest_file = "one.pp"
end

For further details about Vagrant and how to install it on your system please refer to the documentation that can be found here.

For the purposes of this example we would want a three node CentOS cluster. This can be reached by applying a multiple node setup in Vagrant. On node one we want to install the Ambari server which requires a slightly different Puppet script as you will see later. We also forward the Ambari port from localhost to the guest system. This is the complete Vagrantfile used:

# Vagrantfile API/syntax version. Don't touch unless you know what you're doing!
VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  config.vm.box = "puppetlabs/centos-6.5-64-puppet"
  config.vm.box_url = "http://developer.nrel.gov/downloads/vagrant-boxes/CentOS-6.4-x86_64-v20130731.box"
  
  config.vm.synced_folder "ssh", "/root/.ssh"

  config.vm.define :one do |one| 
    one.vm.hostname = "one.cluster"
    one.vm.network :private_network, ip: "192.168.0.101"
    one.vm.provider :virtualbox do |vb|
      vb.customize ["modifyvm", :id, "--memory", 2048]
    end

    one.vm.network "forwarded_port", guest: 8080, host: 8080

    one.vm.provision "puppet" do |puppet|
      puppet.manifests_path = "manifest"
      puppet.module_path = "modules"
      puppet.manifest_file = "one.pp"
    end
    
    one.vm.provision "shell" do |s|
      s.inline = "sudo chmod 600 /root/.ssh"
    end
  end

  config.vm.define :two do |two| 
    two.vm.hostname = "two.cluster"
    two.vm.network :private_network, ip: "192.168.0.102"
    two.vm.provider :virtualbox do |vb|
      vb.customize ["modifyvm", :id, "--memory", 2048]
    end

    two.vm.provision "puppet" do |puppet|
      puppet.manifests_path = "manifest"
      puppet.module_path = "modules"
      puppet.manifest_file = "two.pp"
    end
    
    two.vm.provision "shell" do |s|
      s.inline = "sudo chmod 600 /root/.ssh"
    end
  end

  config.vm.define :three do |three| 
    three.vm.hostname = "three.cluster"
    three.vm.network :private_network, ip: "192.168.0.103"
    three.vm.provider :virtualbox do |vb|
      vb.customize ["modifyvm", :id, "--memory", 2048]
    end

    three.vm.provision "puppet" do |puppet|
      puppet.manifests_path = "manifest"
      puppet.module_path = "modules"
      puppet.manifest_file = "three.pp"
    end
  end

end

Provisioning HDP 2.1

Installing a HDP cluster using Ambari can be achieved following this documentation step by step. Here we would want to automate the whole process. According to the documentation in a first step we need to install ntp  service, disable iptables  as it might interfere with our services, and at last install Apache Ambari. During this setup we have to make sure networking is setup correctly as the hosts need to be able to discover each other either by proper setup of DNS or using /etc/hosts  file. As we need to apply this for each host separately we are going to place them into seperated Puppet modules that can be reused. The Puppet modules used here are interfering_services, ntp, etchosts.

The interfering_services  Puppet Module

Here we would like to disable iptables and Package Kit.

class interfering_services {
  # Disable Package KIT
  file { 'packageKit':
    path    => "/etc/yum/pluginconf.d/refresh-packagekit.conf",
    ensure  => "present",
    replace => true,
    content => " enabled=0"
  }

  # Stop IP Tables
  exec { "stop_ip_tables":
    path    => ["/bin/", "/sbin/", "/usr/bin/", "/usr/sbin/"],
    command => "service iptables stop"
  }
  
   exec { "stop_ip_tables6":
    path    => ["/bin/", "/sbin/", "/usr/bin/", "/usr/sbin/"],
    command => "service ip6tables stop"
  }
}

The ntp  Puppet Module

Installing time service on each cluster with this module.

class ntp {
  package { 'ntp':
    name   => "ntp",
    ensure => present
  }

  service { 'ntp-services':
    name   => "ntpd",
    ensure => running,
    require => Package[ntp] 
  }
}

The etchosts  Puppet Module

# Ensure that the machines in the cluster can find each other without DNS
class etchosts ($ownhostname) {
  host { 'host_one':
    name  => 'one.cluster',
    alias => ['one', 'one.cluster'],
    ip    => '192.168.0.101',
  }

  host { 'host_two':
    name  => 'two.cluster',
    alias => ['two', 'two.cluster'],
    ip    => '192.168.0.102',
  }

  host { 'host_three':
    name  => 'three.cluster',
    alias => ['three', 'three.cluster'],
    ip    => '192.168.0.103',
  }

  file { 'agent_hostname':
    path    => "/etc/hostname",
    ensure  => present,
    replace => true,
    content => "${ownhostname}", # own hostname
    owner   => 1546
  }

  file { 'agent_sysconfig':
    path    => "/etc/sysconfig/network",
    ensure  => present,
    replace => true,
    content => "NETWORKING=yes nHOSTNAME=${ownhostname}" # own hostname
  }
}

Installing Ambari

Three out of the three nodes will run an Ambari agent while one runs the server. Here again we going to use Puppet modules to provision the Ambari server and agents onto the nodes.

The ambari-server  Puppet Module

First we need to install the repository from where we want to install Ambari from. Here we add the public-rep-1.hortonworks.com  location to the repository list. Then install the Ambari server package and run the setup.

class ambari_server ($ownhostname) {
  Exec {
    path => ["/bin/", "/sbin/", "/usr/bin/", "/usr/sbin/"] }

  # Ambari Repo
  exec { 'get-ambari-server-repo':
    command => "wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.6.1/ambari.repo",
    cwd     => '/etc/yum.repos.d/',
    creates => '/etc/yum.repos.d/ambari.repo',
    user    => root
  }

  # Ambari Server
  package { 'ambari-server':
    ensure  => present,
    require => Exec[get-ambari-server-repo]
  }

  exec { 'ambari-setup':
    command => "ambari-server setup -s",
    user    => root,
    require => Package[ambari-server]
  }

  service { 'ambari-server':
    ensure  => running,
    require => [Package[ambari-server], Exec[ambari-setup]],
    start   => Exec[ambari-server-start]
  }

  exec { 'ambari-server-start':
    command => "ambari-server start",
    require => Service[ambari-server],
    onlyif  => 'ambari-server status | grep "not running"'
  }
}

The ambari-agent  Puppet Module

As for the server setup for the agent we also need to install the repository first before initializing the agent itself.

class ambari_agent ($ownhostname, $serverhostname) {
  Exec {
    path => ["/bin/", "/sbin/", "/usr/bin/", "/usr/sbin/"] }


  # Ambari Repo
  exec { 'get-ambari-agent-repo':
    command => "wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.6.1/ambari.repo",
    cwd     => '/etc/yum.repos.d/',
    creates => '/etc/yum.repos.d/ambari.repo',
    user    => root
  }

  # Ambari Agent
  package { 'ambari-agent':
    ensure  => present,
    require => Exec[get-ambari-agent-repo]
  }

  file_line { 'ambari-agent-ini-hostname':
    ensure  => present,
    path    => '/etc/ambari-agent/conf/ambari-agent.ini',
    line    => "hostname=${serverhostname}", # server host name
    match   => 'hostname=*',
    require => Package[ambari-agent]
  }

  exec { 'hostname':
    command => "hostname ${ownhostname}", # own host name
    user    => root
  }

  exec { 'ambari-agent-start':
    command => "ambari-agent start",
    user    => root,
    require => [Package[ambari-agent], Exec[hostname], File_line[ambari-agent-ini-hostname]],
    onlyif  => 'ambari-agent status | grep "not running"'
  }
}

Having setup this modules we can now easily reference those from our provisioning script at node one, two, and three. The scripts for node two and three are almost identical.

Puppet script node one:

include interfering_services

# Install and enable ntp
include ntp

# Ensure that servers can find themselves even in absence of dns
class { 'etchosts':
  ownhostname => 'one.cluster'
}

# Install and enable ambari server
class { 'ambari_server':
  ownhostname => 'one.cluster'
}

# Install and enable ambari agent
class { 'ambari_agent':
  ownhostname    => 'one.cluster',
  serverhostname => 'one.cluster'
}

# Establish ordering
Class['interfering_services'] -> Class['ntp'] -> Class['etchosts'] -> Class['ambari_server'] -> Class['ambari_agent']

Puppet script for node two/three:

include interfering_services

# Install and enable ntp
include ntp

# Ensure that servers can find themselves even in absence of dns
class { 'etchosts':
  ownhostname => 'one.cluster'
}


class { 'ambari_agent':
  serverhostname => "one.cluster",
  ownhostname    => "two.cluster"
}

# Establish ordering
Class['interfering_services'] -> Class['ntp'] -> Class['etchosts'] -> Class['ambari_agent']

Taking it from here we would already be able to provision the complete Hadoop cluster using Ambari’s automated installation process. Just point your browser to localhost:8080  and login using admin  as your user name and password.

A better way would be to use Ambari Blueprints to provision the complete cluster automatically.

Further Readings

Advertisements

2 thoughts on “Provisioning a HDP Dev Cluster with Vagrant

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s