While at it’s core TensorFlow is a distributed computation framework besides the official HowTo there is little detailed documentation around the way TensorFlow deals with distributed learning. This post is an attempt to learn by example about TensorFlow’s distribution capabilities. Therefor the existing MNIST tutorial is taken and adapted into a distributed execution graph that can be executed on one or multiple nodes.
The framework offers two basic ways for distributed training of a model. In the simplest form the same data and computation graph is executed on multiple nodes in parallel on batches of the replicated data. This is known as Between-Graph Replication. Each worker updates the parameters of the same model, which means that each of the worker nodes are sharing a model. Updates to the shared model get averaged before being applied, this is at least the case for the synchronous training of a distributed model. In case of an asynchronous training the workers update the shared model parameters independently of each other. While the asynchronous training is known to be faster, the synchronous training proofs to provide more accuracy.
Continue reading “Distributing TensorFlow”
Recently I had the pleasure of traveling to San Diego the self-proclaimed American capital of Craft Beer. Of course I had to do a little research of my own while visiting this amazing city. Here is the short abstract of a tasteful field trip to some of the exceptional beer places in San Diego. Continue reading “San Diego Brewery Field Trip”
Over two years ago in March 2014 I joined the Iron Blogger community in Munich, which is one of the largest, still active Iron Blogger communities worldwide. You can read more about my motivation behind it here in one of the 97 blog posts published to date: Iron Blogger: In for a Perfect Game.
The real fact is that I write blogs solely for myself. It’s my own technical reference I turn to. Additionally writing is a good way to improve once skills and technical capabilities, as Richard Guindon puts it in his famous quote:
“Writing is nature’s way of letting you know how sloppy your thinking is.”
What could be better suited to improve something than by leaning into the pain, how the great Aaron Swartz, who died way too early, once described it? And it is quite a bit of leaning into the pain publishing a blog post every week. Not only for me, but also for those close to me. But I am going to dedicate a separate blog post to a diligent retrospection in the near future. This post should all be about NUMBERS. Continue reading “2016 in Numbers”
Apache Zeppelin is a web-based, multi-purpose notebook for data discovery, prototyping, reporting, and visualization. With it’s Spark interpreter Zeppelin can also be used for rapid prototyping of streaming applications in addition to streaming-based reports.
In this post we will walk through a simple example of creating a Spark Streaming application based on Apache Kafka. Continue reading “Simple Spark Streaming & Kafka Example in a Zeppelin Notebook”
In any HDP cluster with a HA setup with quorum there are two NameNodes configured with one working as the active and the other as the standby instance. As the standby node does not accept any write requests, for a client try to write to HDFS it is fairly important to know which one of the two NameNodes it the active one at any given time. The discovery process for that is configured through the hdfs-site.xml.
For any custom implementation it’s becomes relevant to set and understand the correct parameters if a current hdfs-site.xml configuration of the cluster is not given. This post gives a sample Java implementation of a HA HDFS client. Continue reading “Sample HDFS HA Client”
Livy.io is a proxy service for Apache Spark that allows to reuse an existing remote SparkContext among different users. By sharing the same context Livy provides an extended multi-tenant experience with users being capable of sharing RDDs and YARN cluster resources effectively.
In summary Livy uses a RPC architecture to extend the created SparkContext with a RPC service. Through this extension the existing context can be controlled and shared remotely by other users. On top of this Livy introduces authorization together with enhanced session management.
Analytic applications like Zeppelin can use Livy to offer multi-tenant spark access in a controlled manner.
This post discusses setting up Livy with a secured HDP cluster.
Continue reading “Connecting Livy to a Secured Kerberized HDP Cluster”
Hadoop supports multiple file formats as input for MapReduce workflows, including programs executed with Apache Spark. Defining custom InputFormats is a common practice among Hadoop Data Engineers and will be discussed here based on publicly available data set.
The approach demonstrated in this post does not provide means for a general MATLAB™ InputFormat for Hadoop. This would require significant effort in finding a general purpose mapping of MATLAB™’s file format and type system to the ones of HDFS. Continue reading “Custom MATLAB InputFormat for Apache Spark”
MATLAB™ is a widely used professional tool for numerical processing used across multiple divers disciplines like Physics, Chemistry, and Mathematics. You can encounter multiple public data sets which are published in MATLAB™ format. This article gives a brief example of such data set and reading it from R. Continue reading “Reading Matlab files with R”
Apache Ambari rapidly improves support for secure installations and managing security in Hadoop. Already now it is fairly convenient to create kerberized clusters in a snap with automated procedures or the Ambari wizard.
With the latest release of Ambari kerberos setups get baked into blueprint installations making separate methods like API calls unnecessary. In this post I would like to briefly discuss the new option in Ambari to use pure Blueprint installs for secure cluster setups. Additionally explaining some of the prerequisites for a sandbox demo like install. Continue reading “Kerberos Ambari Blueprint Installs”