Deep Learning at Scale Using Cray Distributed Training

This article was written by the following Cray and NERSC contributors: Steve Farrell, Machine Learning Engineer; Thorsten Kurth, Application Performance Specialist; Jacob Balma, Performance Engineer; Peter Mendygral, Performance and Software Engineer; Nick Hill, Software Engineer.

Deep neural networks (DNN) are revolutionizing science across many domains including high energy physics, cosmology, biology, and climate. As the field of deep learning advances, DNN architectures grow more sophisticated and capable of solving complex tasks in scientific problems such as classification, regression, and simulation. Training and evaluating such models requires increasingly large datasets and computing resources.

Through the NERSC Big Data Center collaboration, NERSC and Cray have joined forces to enable scientific deep learning applications to effectively utilize the power of supercomputers.

In this blog post, we will discuss deep learning at scale, the Cray Distributed Training Framework (Cray PE ML Plugin for distributed data-parallel training of DNNs) and how the plugin can be used across a range of science domains with a few working examples.

Data Parallel Training
In general, there are three types of parallelism used in scaling the training of deep networks, referred to as data-parallel, model-parallel and layer-pipelined training (Figure 1).


Figure 1: Data Parallel Training (Image from Tal Ben-Nun and Torsten Hoefler, 2018 Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis)

Data-parallelism is one of the most popular methods today for speeding up DNN training because of its simplicity. In data-parallel training, the dataset is split across worker processes which each train on their own subsets of the global dataset, called mini-batches. Gradients are accumulated across workers to compute weight updates, and these updates are what get executed either synchronously or asynchronously. Some of the types of synchronization schemes are provided below.


Figure 2: Distributed Training Schemes (Image from Tal Ben-Nun and Torsten Hoefler, 2018 — Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis)

The most common type of synchronization scheme for data-parallel distributed training is synchronous stochastic gradient descent (S-SGD). This replicates the serial SGD algorithm and guarantees model consistency across workers. When scaling up the number of workers, one typically either holds the global mini-batch size fixed (“strong scaling”) or holds the per-worker local mini-batch size fixed (“weak scaling”). The former case suffers from computation starvation and communication bottlenecks at large scales, whereas the latter usually scales better in terms of throughput but introduces additional challenges in convergence. (Convergence challenges at scale is a complex topic deserving of its own blog post.)

The Cray PE ML Plugin
The Cray PE ML Plugin is a scalable solution for distributed data-parallel training which easily plugs into popular frameworks like TensorFlow, Keras, and PyTorch. It stands out from similar tools such as TensorFlow’s gRPC by using highly optimized communication based on the Message Passing Interface (MPI) to perform distributed gradient computation.

In Figure 2, notice the two types of synchronous implementations: one using a parameter server, and one using a “decentralized” MPI communication pattern known as all-reduce. An all-reduce is a summation or average across all parallel processes of some arbitrary data. In the case of the Cray PE ML Plugin, the all-reduce is performed on the buffers holding gradients. So, despite the fact that each worker process will see different sets of examples in their local mini-batches, the end result is an aggregate of all processes’ independent gradients.

After the all-reduce completes, each worker uses the updated gradients to update its local weights. This represents one global step or iteration of SGD. As long as we initialize each workers’ local model weights to matching (random) values at the start of training, the set of distributed models are then guaranteed to remain synchronized throughout training.

A key feature of the plugin is its support for delayed synchronous gradients. During an optimization step, the plugin can use the gradients from a previous optimization step to compute the current update to the model weights. The benefits are twofold. It allows overlap communication with computation for enhanced scalability — effectively removing the bottleneck from communication at large scale — and can provide a form of regularization in training.

For more details on the plugin’s features and strengths, see the papers CosmoFlow: Using Deep Learning to Learn the Universe at Scale and High Performance Scalable Deep Learning with the Cray Programming Environments Deep Learning Plugin.

Using the Plugin
High-performance computing (HPC) facilities like NERSC and supercomputers such as Cori (a Cray® XC40™ system) offer big advantages for DNN applications. The enormous compute power coupled with high-performance networking and I/O allow for efficient, scalable training.

We used the Cray PE ML plugin on Cori with two example models written in TensorFlow. The first is the HEP-CNN benchmark a high-energy physics application for classifying ATLAS detector images. The HEP-CNN model (Figure 3) is a basic convolutional neural network with four convolutional layers, pooling layers, two fully-connected layers, and a binary classification output. The example uses TensorFlow’s MonitoredTrainingSession for setting up the training session with insertable hooks. The second example uses the standard ResNet architecture from the official TensorFlow models repository, and uses the latest Estimator API and functionality available in TensorFlow. The code examples we will be discussing are available on Github.


Figure 3: The HEP-CNN Model Architecture

Enabling the Plugin
To enable scaling with the Cray PE ML plugin, we only need to add a few changes to a serial (or gRPC-distributed) training script (starting with this to end up with a distributed training script like this).

At the highest level, the modifications needed will:

1) Initialize MPI and the Cray PE ML Plugin
2) Synchronize model weights and biases at the start of training with a broadcast hook
3) Aggregate gradients in each training step with an all-reduce.

For this post, we just highlight the key snippets that utilize the plugin. Refer to the examples and references for full details.

Initialization
At the start of our training script we imported the Cray PE ML Plugin as follows:

The next essential snippet is the function call which sets up MPI, which should come after all imports but before any other code:

Once MPI is initialized we can query the number of workers and the current worker number (i.e., the rank) with calls to mc.get_nranks() and mc.get_rank(), respectively. This lets us implement rank-specific logic such as printing and saving checkpoints or splitting up the dataset evenly across workers.

The rest of the initialization of the plugin actually happens in the next session via the HEP-CNN benchmark InitConfigBcastTensors hook.

Configure and Broadcast
For the configuration of the plugin as well as model weight synchronization we use a TensorFlow SessionRunHook which allows to plug in code for execution at the start and end of a session and before/after all calls to session.run().

In our run script, before training, we construct the InitConfigBcastTensors hook as follows:

The arguments to the hook constructor specify the total number of optimization steps and the fraction of steps which should be fully prompt-synchronous.

The implementation of this hook can be found in the cray_mlcomm_helper_functions module. In its begin method, the hook initializes and configures the plugin and then sets up the broadcast operation with the following code:

The broadcast operation is then executed in the hook’s after_create_session method, ensuring that our models across workers are synchronized immediately after our training session is created.

Gradient Aggregation and Application
Now, we arrive at the most critical step: defining the gradient aggregation across workers. When preparing the optimization step operation in TensorFlow, we need to intercept the gradient computation from the optimizer and use the mc.gradients function before giving the result back to the optimizer. We accomplish this with just four lines of code:

The result of inserting this into the TensorFlow computation graph is that at every training step iteration, each worker aggregates gradients from all the other workers and uses the result to update its model weights. Since every worker is computing the same update, the models stay in sync across all training steps.

Finalization and Summary
The only piece left to highlight is the function call which finalizes MPI and cleans up the plugin’s communication buffers. This part goes in the script after training has completed:

Note that the rest of the model and training code are the same as they would be normally. With only these few modifications to the setup portions of our training script we can enable scalable data-parallel distributed training of our deep neural networks. The plugin’s optimized features should allow anyone to see linear speedups in training sample throughput allowing to effectively leverage large scale HPC resources for deep learning.

Scaling Performance on the Cori Supercomputer
We’ve shown how straightforward it is to use the plugin in your TensorFlow training code on Cori, but how well does it scale? Here are three examples: the HEP-CNN High Energy Physics application, results from a standard image-classification application (ResNet), and CosmoFlow.

Figure 4 shows the training throughput of HEP-CNN scaled up to 1,000 Intel® Xeon Phi™ “Knights Landing” nodes using the Cray PE ML plugin as well as alternative tools like Horovod with MPI and Intel Machine Learning Scaling Library (MLSL). We can see that all three solutions have good scaling performance up to hundreds of nodes, but that the Cray plugin scales best up to 1,000 nodes. This demonstrates the benefits of the plugin’s delayed gradients feature, which allow to overlap communication and computation and reduce the bottleneck from network communication.


Figure 4: Weak-scaling training throughput of HEP-CNN on Cori KNL

Figure 5 compares training of ResNet50 using the ImageNet dataset on Cori with Cray PE ML Plugin and Horovod at a local batch sizes of 32 and 128. In TFRecord format, the ImageNet dataset requires approximately 145 GB of storage space. This demonstrates that the plugin can be used easily with modestly sized datasets across a range of batch sizes without impacting the scalability of the distributed training algorithm, whereas Horovod can suffer from decreased scaling efficiency at small local batch sizes at large scale.


Figure 5: Weak-scaling training throughput of ResNet50 on Cori KNL

Figure 6 shows CosmoFlow scaling on Cori and highlights the use of the Cray PE ML Plugin in combination with Cray’s Burst Buffer technology to allow for efficient scaling out to 8,192 nodes. This capability is especially relevant for scientific applications like CosmoFlow and HEP-CNN where datasets can be significantly larger than ImageNet. In the case of CosmoFlow, there was of a total of 1.4 TB of TFRecords to be read in as training data.


Figure 6: Scaling of fully synchronous training on Cori based on epoch timing. Right figure is a zoomed-in version of the left one

Run the HEP-CNN Example Yourself
To checkout the example code to view or run yourself, use the following git command to clone the repository:

Cray PE ML Plugin Documentation
For information on the Cray PE ML Plugin’s python API, you can start a python session and type the following:

In summary, we’ve shown how the Cray PE ML Plugin can be used through its Python API to add distributed training capabilities to a serial TensorFlow training script. We’ve learned how each call described in the example above maps to a corresponding phase of the distributed training algorithm. This strategy of splitting up the dataset across workers, broadcasting initial weights and biases, and reducing gradients with each global training step can be applied generally to other training scripts and DNNs to achieve similar performance improvements.

Stay tuned for future NERSC BDC blog postings where we’ll dive deeper into other scientific use cases of applied DL and show how optimizer choice and state-of-the-art learning rate scheduling can be used to maintain convergence properties at even the most extreme scales.

We also provide links to references in the section below.

For references in this blog, see:
[1] Tal Ben-Nun and Torsten Hoefler, 2018 — Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency, Analysis, arXiv:1802.09941 [cs.LG]

[2] Peter Mendygral, Nick Hill, Krishna Kandalla, Diana Moise, Jacob Balma and Marcel Schongens, 2018 — High Performance Scalable Deep Learning with the Cray Programming Environments Deep Learning Plugin, Cray Users Group,
www.cray.com/sites/default/files/resources/CUG-DeepLearning.pdf

[3] Thorsten Kurth, Mikhail Smorkalov, Peter Mendygral, Srinivas Sridharan, Amrita Mathuriya, 2018 — TensorFlow at Scale – MPI, RDMA and All That, Cray Users Group, https://cug.org/proceedings/cug2018_proceedings/includes/files/pap146s2-file1.pdf

[4] Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Karn, Daina Moise, Simon J. Pennycook, Kristyn Maschoff, Jason Sewall, Nalini Kumar, Shirley Ho, Mike Ringenburg, Prabhat, Victor Lee, 2018 — CosmoFlow: Using Deep Learning to Learn the Universe at Scale, https://arxiv.org/pdf/1808.04728.pdf

Speak Your Mind

Your email address will not be published. Required fields are marked *