By: Jerry Lam, Machine Learning Engineer

Some of the most successful machine learning models we built in Paytm Labs rely on deep learning techniques using TensorFlow library. Google has released distributed TensorFlow in 2016 to train deep learning models that are too big for a single machine to handle. While this is great for data scientists, it requires additional infrastructure in order to take full advantage of it. At Paytm Labs, we use Spark from Amazon Elastic MapReduce (EMR) extensively to generate training datasets for our deep learning models. Therefore it is important for us to train TensorFlow models using the data produced from Spark directly without the need to copy the data back and forth between TensorFlow and Spark.

To bring the Distributed TensorFlow’s capability to all machine learning practitioners at Paytm Labs, we have built a tool to automate the provisioning of distributed TensorFlow in EMR. To avoid reinventing the wheel, we did some research on the open source options. We found that TensorFlow on Spark (TFoS) by Yahoo is the closest to what we need.  TFoS requires the least amount of migration efforts from TensorFlow programs to use datasets produced by Spark. It supports all TensorFlow functionalities and it does not have the driver bottleneck issue that many frameworks have. Since Amazon AWS already automates the deployment of Spark in EMR, we only need to provision the Distributed TensorFlow on top of Spark. To do this, we use the Bootstrap functionality of EMR to install and set up the libraries required by Distributed TensorFlow. To make the task even more complex, we want to make use of the GPU devices in EMR to further speed up model training. At the time of writing, there are no known bootstraps for TensorFlow on EMR. In this post, we will walk through this bootstrap process.

  1. Install kernel-devel package

First and foremost, it is necessary to install the kernel-devel package on each EMR node (including the master node). This is needed because the NVidia driver requires it to build successfully. The release version of the Linux distribution must be provided to yum in order to correctly fetch the kernel-develop package from the AWS package repository.

  1. Install NVIDIA driver

After the kernel-devel package is installed, it is time to install the Nvidia driver. The G2 instances (available in EMR) use NVIDIA GRID K520 GPUs. Please make sure to download the driver corresponding to the NVidia device. Here is more information about the NVidia device in G2 instances. After the download is completed, install the driver. This will take a couple of minutes.

  1. Install CUDA toolkit

Depending on which NVidia driver is installed, a different version of CUDA toolkit is needed to download. For instance, in our case, we install CUDA version 8.0 in all nodes. After the toolkit is installed, add the CUDA executables and libraries to the LD_LIBRARY_PATH  and then run ldconfig  to ensure that the new libraries are cached in the system.

  1. Install cuDNN

Depending on which CUDA toolkit is installed, a different version of cuDNN is needed to download. For instance, in our case, we install cuDNN version 5.1. After the toolkit is installed, add the CUDA executables and libraries to the LD_LIBRARY_PATH  and then run ldconfig  to ensure that the new libraries are cached in the system. After the download, follow the instruction from NVidia on how to install the cuDNN library. After it is installed, run ldconfig  to ensure that the cuDNN is cached and ready to be used.

  1. Install TensorFlow on Spark

Download the TFoS source code from git and then zip the tensorflowonspark directory. The zip package is needed during the execution time as described in the TFoS repository. For more information about TFoS installation, please refer this.

  1. Install TensorFlow and other data science libraries

The easiest way to install TensorFlow and other data science libraries is to use pip. After the installation is completed, test the TensorFlow installation by simply importing TensorFlow in the python shell and running this command:

It should display that TensorFlow recognizes the NVidia device and it is the default device to execute the computation specified in the TensorFlow application.

After the above 6 steps are completed, we can run the MNIST example provided in TFoS. Below is the sample bootstrap script for your reference. The script is written for EMR 5.5.0 version but it should not be difficult to modify it to use other versions of EMR following the above steps. You need to upload the script to S3 and then reference it in the bootstrap section in the EMR launch dashboard.


Also, since TFoS also supports TensorBoard, you can access TensorBoard after running the TensorFlowOnSpark application via the provided URL. A screenshot of one of our LSTM models from TensorBoard is illustrated below:

I hope this tutorial will help you to build deep learning models on EMR!