Sliding Apache Cassandra Onto YARN

With the most recent release of Hadoop (2.6) comes the support for long running applications on YARN. Apache Slider is a tool that supports you in creating, managing, and monitoring long running applications, without necessarily changing anything about the way your application works. In a previous blog post I tried to go through the different aspects of long running applications Slider tries to resolve. You might also consider watching this webinar about using Slider.

By starting to release my slider demo app, which uses Apache Cassandra, here, I would like to walk through some of the required packaging steps making it run on a YARN cluster in this blog post. For this example I use a three node test cluster, which you can easily setup with this script using Vagrant and VirtualBox.

Slider App Packages

Slider itself already comes with some useful application examples you can use as a blueprint for what ever application you would want to package. These examples include the very basic example of running Memcached as well as more complex implementations like Storm. In HDP 2.2 Apache Slider is being used to deploy Storm and HBase on a YARN cluster, that is also the reason why you would find them as an example together with Slider’s source code.

The general layout of an Slider application can be seen below:

.
|-- appConfig.json
|-- configuration
|   `-- global.xml
|-- metainfo.xml
|-- package
|   |-- files
|   |   `-- apache-cassandra-1.2.19.tar.gz
|   |-- scripts
|   |   |-- __init__.py
|   |   |-- cassandra.py
|   |   `-- params.py
|   `-- templates
|       |-- cassandra-env.sh.2j
|       |-- cassandra.yaml.2j
|       `-- log4j-server.properties.2j
`-- resources.json

5 directories, 11 files

Everything above the package, that is holding application specific files, general meta information about the app is configured. A general description of your application is contained in metainfo.xml. Here we have specified two components: CASSANDRA_ONE and CASSANDRA_TWO. Both of this components issue the cassandra.py maintenance of the deployed service app.

The resource requirements of your application in a YARN context goes into the resources.json file. Each component gets a unique yarn.role.priority assigned to it.

Central configurations for your application run by Slider is provided in the appConfig.json. Parameters here are also later provided in the scripts used to start, stop, and configure the given service. Within this context you also have access to Slider exposed variables like ${THIS_HOST} , which will return the host the specific component of your application is deployed on. application.def is the name of the zip package deployed in HDFS. You will see late what this means when we’ll deploy Cassandra onto the YARN cluster.

Your application and all the necessary dependencies need to be included as a tarball under files. In this case I downloaded Apache Cassandra placing it directly under the files folder.

Deploying Your App Package

Slider will place your application dependencies on the host it plans to install one or more of your components. After that the given script will be issued to install, configure, start, stop, and query the status of the given component. Here each component will be managed by the cassandra.py script under scripts. Which in this case is a Python script.

Any non trivial application script will probably need some kind of information about the current environment. Specific configuration files need to be adjusted and so on. Slider for once gives you resource_management library which assist you to access configuration parameters provided. You can read for example any configuration given in appConfig.json using:

config = Script.get_config()
java64_home = config['hostLevelParams']['java_home']
....

In addition to that Slider gives you multiple ways to change XML, YAML, Properties Files, or any other configuration file. By providing configuration files as templates to Slider this can be parametrized from the script. Here is an example of configuring the cassandra.yaml to our needs:

TemplateConfig(
    format("{conf_dir}/cassandra.yaml"),
    owner = params.cassandra_user
)

Within the template we have access to all the given parameters:

# Cassandra storage config YAML
cluster_name: '{{cluster_name}}'

....

Run On YARN

The demo application needs to be packaged as a zip file from within the directory structure. Navigate to the root directory of the app and issue the follwoing command:

$ zip slider-cassandra-1.2.19_0.1.zip *

Throughout this example we will deploy the app using the existing hive user for convenience. After zipping our project we need to copy it first to your virtual machine prior to uploading it into HDFS.

$ scp -i ~/.vagrant.d/insecure_private_key slider-cassandra-1.2.19_0.1.zip vagrant@192.168.33.102:
$ vagrant ssh three
$ chmod 755 .
$ sudo su
$ su -l hive -c "hdfs dfs -copyFromLocal /home/vagrant/slider-cassandra-1.2.19_0.1.zip /user/hive"

We can then use the Slider agent to start the application using YARN:

$ unzip slider-cassandra-1.2.19_0.1.zip
$ su -l hive -c "slider create cl1 --template /home/vagrant/appConfig.json --resources /home/vagrant/resources.json"

This should start your application and by going to the RM UI you should be able to see it running:

Slider Example App being Deployed on YARNDemo App on GitHub: https://github.com/hkropp/slider-cassandra

Leave a comment