Distributing TensorFlow

While at it’s core TensorFlow is a distributed computation framework besides the official HowTo there is little detailed documentation around the way TensorFlow deals with distributed learning. This post is an attempt to learn by example about TensorFlow’s distribution capabilities. Therefor the existing MNIST tutorial is taken and adapted into a distributed execution graph that can be executed on one or multiple nodes.

The framework offers two basic ways for distributed training of a model. In the simplest form the same data and computation graph is executed on multiple nodes in parallel on batches of the replicated data. This is known as Between-Graph Replication. Each worker updates the parameters of the same model, which means that each of the worker nodes are sharing a model. Updates to the shared model get averaged before being applied, this is at least the case for the synchronous training of a distributed model. In case of an asynchronous training the workers update the shared model parameters independently of each other. While the asynchronous training is known to be faster, the synchronous training proofs to provide more accuracy.

But there is also another way in which TensorFlow is able to distribute it’s computation. In case of the In-Graph Replication distribution there is only one client that contains the model parameters and assigns the compute intensive calculation of the model to specific worker tasks, essentially working like a resource manager. Between-Graph Replication is the most common distribution model one finds on the internet.

Kind of Processes

Let’s quickly touch on the different responsibilities or roles a process in the TensorFlow framework can take on. For one in each TensorFlow graph there is at least one client. The client essentially executes the graph computation by connecting to a local or remote Session. For distribution the client would connect to a Master service, which is responsible for distributing the processing among worker nodes. Finally there are the workers which does the actual computation, hereby it is helpful to understand that TensorFlow is a general purpose computation framework, so this computation can be almost anything defined as a step inside the computation graph of the client.

In the case of multiple clients running simultaneously (eg. Between-Graph Replication) each client would also run initialization step like parameter initialization. That is not only a waste of resources but would in an asynchronous execution lead to unexpected results. For this to not happen TensorFlow assigns to one of the workers a special role for doing the initialization steps for all clients once, that role is the role of the chief worker.

Distributed MNIST

Here we are taking the simple MNIST example from the TensorFlow tutorial and adapted it to run in a distributed way. For testing and demonstration purposes the code can be executed on the same machine.

Taks or processes that belong to a execution graph in TensorFlow are considered a cluster. Each task in a cluster can take one of the previously defined roles. In the below example the cluster has a set of Parameter Server (ps) and Workers (workers) which are given by a comma separated list of hostnames + ports.

For us to run this on a single machine we could execute it like this multiple times for each participating process. To be distinguishable each task has an assigned index.

During the execution of the script we also define via the --job_name parameter which role the process takes. Parameter Servers (ps) simply join the Session, while workers depending on the kind of distribution execute different aspects of the graph calculation.

The parameter service processes share and coordinate accumulated updates to the parameters of the model. A worker process executes a specific task as part of the graph execution. We already discussed that with Between-Graph replication each worker processes the same training on a batch of input data. It is probably the most common distributed training mode one can find on the internet and is also demonstrated here. Thereby each worker has it’s own client and graph of execution sharing model parameters while executing computation on a batch of input data

The supervisor takes care of session initialization, restoring from a checkpoint, and closing when done or an error occurs. It is also responsible for initialising the session, like the parameters, for all workers.

Running this on one or multiple machines the following commands have to be executed on each machine or multiple times on the same machine. One would need to adapt the IP address in the following commands:

The complete code of a softmax MNIST training in a distributed TensorFlow graph can be found below:

Further Reading

2 thoughts on “Distributing TensorFlow”

  1. Hello,

    Thanks for your blog on

    When I run your code – same processing happens occurs on both worker nodes. 20 Epocs run on both worker nodes.
    Ideally with distributed processing we expect compute or data decomposition so that processing completes fast.
    What should I change in code for this to happen?

    Regards

Leave a Reply