TensorFlow on YARN Using Slider

If you look at the way TensorFlow distributes it’s calculation across a cluster of processes, you will quickly ask how to schedule resources as part of a training workflow on large scale infrastructure. Many have turned to Spark as a resource manager for TrndorFlow, At the beginning quite a lot of folks have answered this question by wrapping an additional computational framework around TensorFlow, degrading the former to a distribution framework. Examples of such approaches can be found here and here. Both of them turn to Spark, which just like TensorFlow, is a computational distributed framework turning a set of statements into a DAG of execution. While this certainly would works a more straight forward approach would be to turn to a cluster managers like Mesos, Kubernetes, or namely YARN to distribute the workloads of a DeepLearning networking. Such an approach is also the suggested solution you would find in the TensorFlow documentation:

Note: Manually specifying these cluster specifications can be tedious, especially for large clusters. We are working on tools for launching tasks programmatically, e.g. using a cluster manager like Kubernetes. If there are particular cluster managers for which you’d like to see support, please raise a GitHub issue.

Turning to YARN for these kind of workloads admittedly does not come natural for most developers. For most developers YARN is closely linked to MR, Spark and
their like and still not really seen as a general purpose cluster management system. Additionally it is often criticized for it’s tedious API.

But YARN has been successfully used for various data applications all coexisting on the a shared infrastructure together with a vast amount of data. With TensorFlow on YARN you can start to leverage deep learning on top of your already existing data and workflows. Several on going efforts to support TensorFlow with YARN document the demand for a consolidated solution within the community.

YARN itself is undergoing some active development shaping it’s future as part of Hadoop 3.0. Noteworthy in the context of TensorFlow is the support for Docker containers and a simplified YARN API (Hadoop/YARN Assembly). In a recently published blog post here, Wangda Tan and Vinod Kumar Vavilapalli give a climbs of what TensorFlow on YARN will look like with these up coming changes to YARN.

This post uses Apache Slider, which will likely be discontinued in the future in favour of YARN Assembly, on top of a stable HDP distribution for training of a simple TensorFlow example. Part of this is available as of a Slider packages contributed through SLIDER-xxxx. The aim here is not to train a sophisticated model, but to demonstrate the feasibility of distributing TensorFlow on YARN today.

Resources needed for a distributed training workflow are a Parameters Services (PS) and Worker (worker). Additionally there is one special worker, necessary for example to execute all boot scraping steps required by all works just once, the Chiefworker (chiefworker). That is why the resource configuration for Slider will look similar like this:

Distributing TensorFlow requires each process participating in the training to have a complete specification of the cluster available. With YARN we will not know at the beginning on which host or port each of the processes will run. A key aspect of the distribution therefor will be to wait until all the resources have been allocated in the cluster.

Slider will help us with that as the allocated resources for each service can be queried via a REST API. In the above snipped, which will result in waiting for the complete allocation of containers, the function get_launched_instances() fetches the launched instances of parameters services and workers with three simple REST calls:

Once all the resources are available we can continue like with any other distributed scenario with TensorFlow. On all hosts the same script will be executed given the list of workers, a list of parameter servers, the index, and run mode.

Which training script we are executing is an application configuration parameter making it possible to distribute different kinds of training with the same Slider skeleton. The script is specified in appConfig.json as

Running workflows like this on a shared infrastructure often require individual users to control the complete environment eg. version of TensorFlow in which they perform the training. In the above scenario the same TensorFlow package needs to be installed on all hosts. There are a couple of solution to deal with the demand to support control of the execution environment. One solution could be to distribute the complete environment as part of the distribution of the workload, which could here for example be achieved by using python virtual env as the environment. A often preferred solution to this would be the use of a container environment like Docker to run the script inside this environment. To use Docker here simply replace the daemon_cmd provided here with a proper docker run execution.

TensorFlow executes reportedly much faster when run on GPU cores. YARN offers labeling to support a scenario where for example the workers will be executed on hosts with GPUs, while the parameter servers get scheduled on none GPU host. Labeling the hosts with GPU as ‘gpu’ and adding “yarn.label.expression”:”gpu” to the resource description of the workers in resoources.json would ensure that the workers are scheduled on GPU hosts, while the other processes would be executed wherever available resources are there.

Further Reading

Distributing TensorFlow

While at it’s core TensorFlow is a distributed computation framework besides the official HowTo there is little detailed documentation around the way TensorFlow deals with distributed learning. This post is an attempt to learn by example about TensorFlow’s distribution capabilities. Therefor the existing MNIST tutorial is taken and adapted into a distributed execution graph that can be executed on one or multiple nodes.

The framework offers two basic ways for distributed training of a model. In the simplest form the same data and computation graph is executed on multiple nodes in parallel on batches of the replicated data. This is known as Between-Graph Replication. Each worker updates the parameters of the same model, which means that each of the worker nodes are sharing a model. Updates to the shared model get averaged before being applied, this is at least the case for the synchronous training of a distributed model. In case of an asynchronous training the workers update the shared model parameters independently of each other. While the asynchronous training is known to be faster, the synchronous training proofs to provide more accuracy.
Continue reading “Distributing TensorFlow”

YARN Secure Container

In a restricted setup YARN executes task of computation frameworks like Spark in a secured Linux or Window Container. The task are being executed in the local context of the user submitting the application and are not being executed in the local context of the yarn or some other system user. With this come certain constraints for the system setup.

How is YARN actually able to impersonate the calling user on the local OS level? This posts aims to give some background information to help answer such questions about secure containers. Only Linux systems are considered here, no Windows.

Continue reading “YARN Secure Container”

TensorFlow: Further Reading

Some collection of papers and work around deep distributed learning to deepen once understanding in that topic:

Large Scale Distributed Deep Networks (link) (December, 2012)
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng

This paper published, among other contributes, by Jeffrey Dean together with Andrew NG probably marks the cornerstone to TensorFlow as it is today.

Efficient Estimation of Word Representations in Vector Space (link) (Januar 2013)
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

The fundamental work around projects like Word2Vec is presented in this paper, where vector representation of words for similarity trained by a neural net is being described.

Sequence to Sequence Learning with Neural Networks (link) (September 2014)
Ilya Sutskever, Oriol Vinyals, Quoc V. Le

The work around sequence to sequence learning is actually quite old. Which seems like a fairly abstract problem to solve has recently proved to significantly improve for example speech to text recognition among other disciplines.

Show and Tell: A Neural Image Caption Generator (link) (November 2014)
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan

Another area were the above described concept of sequence to sequence learning is described is the exploration of images. In this case the input sequence is a bitmap of an image which is transferred to a text sequence describing the image. This marks a fundamental breakthrough in computer AI.

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (link) (November 2015)
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Rafal Jozefowicz, Yangqing Jia, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Mike Schuster, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

The TensorFlow Whitepaper [PDF]

Webinar: TensorFlow: A Framework for Scalable Machine Learning (link) (October 19, 2016)
Martin Wicke, Software Engineer at Google
Rajat Monga, Engineering Director at Google

Martin and Rajat, both software engineers for Google working on TensorFlow, walk through the architecture and design of TensorFlow throughout this webinar.


HDFS Storage Tier – Archiving to Cloud w/ S3

By default HDFS does not distinguish between different storage types hence making it difficult to optimize installations with heterogeneous storage devices. Since Hadoop 2.3 and the integration of HDFS-2832 HDFS supports placing block replicas on persistent tiers with different durability and performance requirements. Continue reading “HDFS Storage Tier – Archiving to Cloud w/ S3”

San Diego Brewery Field Trip

Recently I had the pleasure of traveling to San Diego the self-proclaimed American capital of Craft Beer. Of course I had to do a little research of my own while visiting this amazing city. Here is the short abstract of a tasteful field trip to some of the exceptional beer places in San Diego. Continue reading “San Diego Brewery Field Trip”

Installing HDP Search with Ambari

Ambari Management Packs are a new convenient way to integrate various services to the Ambari stack. As an example in this post we are using the Solr service mpack to install HDP on top of a newly installed cluster.

The HDP search mpack is available on the Hortonworks public repository for download. A mpack essentially is tar balls containing a mpack.json file specification and related binaries. Continue reading “Installing HDP Search with Ambari”

2016 in Numbers

Over two years ago in March 2014 I joined the Iron Blogger community in Munich, which is one of the largest, still active Iron Blogger communities worldwide. You can read more about my motivation behind it here in one of the 97 blog posts published to date: Iron Blogger: In for a Perfect Game.

The real fact is that I write blogs solely for myself. It’s my own technical reference I turn to. Additionally writing is a good way to improve once skills and technical capabilities, as Richard Guindon puts it in his famous quote:

“Writing is nature’s way of letting you know how sloppy your thinking is.”

What could be better suited to improve something than by leaning into the pain, how the great Aaron Swartz, who died way too early, once described it? And it is quite a bit of leaning into the pain publishing a blog post every week. Not only for me, but also for those close to me. But I am going to dedicate a separate blog post to a diligent retrospection in the near future. This post should all be about NUMBERS. Continue reading “2016 in Numbers”

Simple Spark Streaming & Kafka Example in a Zeppelin Notebook

Apache Zeppelin is a web-based, multi-purpose notebook for data discovery, prototyping, reporting, and visualization. With it’s Spark interpreter Zeppelin can also be used for rapid prototyping of streaming applications in addition to streaming-based reports.

In this post we will walk through a simple example of creating a Spark Streaming application based on Apache Kafka. Continue reading “Simple Spark Streaming & Kafka Example in a Zeppelin Notebook”

Kerberos Debug Notes

Some notes for Kerberos debugging in a secure HDP setup:

  1. Setting Debug Logs
    To enable debug logs in Java for Kerberos sun.security.krb5.debug needs to be set to true. Doing this for Hadoop can be done in the hadoop-env.sh file by adding it to the HADOOP_OPTS environment variable:

    Additionally the HADOOP_JAAS_DEBUG variable can be set also:

    Receiving traces in bash/shell can be enabled by setting the following environment variable:
  2. Testing auth_to_local Settings
    Setting the auth_to_local rules correclty can be quite crucial. This is especially true for KDS trust environments. The rules can be easily tested with the HadoopKerberosName call of Hadoop security. You can run it as: