YARN Secure Container

In a restricted setup YARN executes task of computation frameworks like Spark in a secured Linux or Window Container. The task are being executed in the local context of the user submitting the application and are not being executed in the local context of the yarn or some other system user. With this come certain constraints for the system setup.

How is YARN actually able to impersonate the calling user on the local OS level? This posts aims to give some background information to help answer such questions about secure containers. Only Linux systems are considered here, no Windows.

Continue reading “YARN Secure Container” →

Sliding Apache Cassandra Onto YARN

With the most recent release of Hadoop (2.6) comes the support for long running applications on YARN. Apache Slider is a tool that supports you in creating, managing, and monitoring long running applications, without necessarily changing anything about the way your application works. In a previous blog post I tried to go through the different aspects of long running applications Slider tries to resolve. You might also consider watching this webinar about using Slider.

By starting to release my slider demo app, which uses Apache Cassandra, here, I would like to walk through some of the required packaging steps making it run on a YARN cluster in this blog post. For this example I use a three node test cluster, which you can easily setup with this script using Vagrant and VirtualBox. Continue reading “Sliding Apache Cassandra Onto YARN” →

YARN Ready! Are You?

YARN is changing the face of Big Data as we know it today. Breaking with by now well established patterns like MapReduce YARN gives clients the ability to run diverse distributed algorithms on one cluster under one resource management. In addition with Apache Slider comes the possibility to ‘slide’ existing long running service on to the same cluster with the same resource provider.

With the up-coming release of HDP 2.2 HBase services are running under the management of YARN. The same will be true for Storm getting us one step closer to the vision of an Enterprise Hadoop Data Lake. One important aspect in this scenario is yet to come: YARN-796 aka YARN Labels.

Recently the technical preview of Hortonworks Data Platform 2.2 was released with a Sandbox image for download. Giving you the possibility to try out the concepts of tomorrows Big Data platform today with this tutorials. You can also try some of the virtual environments I’ve put together here hdp22-n1-centos6-puppet or hdp22-n3-centos6-puppet.

We are experiencing the dawn of a new era in Hadoop, Hadoop v2. Are you YARN ready? Here are some resources to get you going:

Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with Apache Hadoop 2 (Amazon)
Simple YARN Application (Github)
YARN Word Count Example: The Distributed Shell (Github)
YARN Ready Webinars:
- Integrating to YARN natively (part 1) ( video / slides )
- Integrating to YARN using Slider (part 2) ( video / slides )
- Integrating to YARN with Tez (part 3) ( video / slides )
- Using Ambari for Management ( video / slides )
- Developing Applications on Hadoop with Scalding ( video / slides )
- Using Spark to Integrate to YARN ( video / slides )

Sliding Applications onto YARN

Along with a Hadoop cluster installation usually come some well established services which are part of certain use cases. Rarely is it possible to fully satisfy complex use cases by only applying MapReduce. There could be ElasticSearch for search or a Cassandra cluster for indexing. This and other complementary components, like HBase, Storm, or Hive, of a Hadoop cluster bring the burden of additional complexity when it comes to cluster planing, management, or monitoring. Think for example of the memory planning of a Datanode also running Cassandra. You would have to choose upfront of how many of the given memory you allocate to each. Think of what also will happen as you remove or add new Cassandra nodes to the cluster?

YARN was designed to manage different sets of workloads on a Hadoop setup aside MapReduce. So with modern Hadoop installations the solution to deal with the above challenges means to port the needed services to YARN. Some of the common services have been or are being ported to YARN in a YARN-Ready program led by Hortonworks. As porting existing services to YARN can be by it’s own quite challenging Apache Slider (incubating) was developed to support long-running services by YARN without the requirement to make any changes. Apache Slider’s promise is to run this applications inside YARN unchanged.
Continue reading “Sliding Applications onto YARN” →