Scalable Deep Learning with BigDL on the Urika-XC Software Suite

Deep-learning-based AI approaches have resulted in remarkable advancements in diverse fields such as computer vision, speech recognition, natural language processing and bioinformatics. As part of our continuing efforts to offer customers a single system that can host both simulations and advanced data analytics, we recently launched the Cray® Urika®-XC analytics software suite, bringing graph analytics, deep learning and robust big data analytics tools to the company’s flagship line of Cray® XC™ supercomputers.

An important piece of the Urika-XC suite is the BigDL deep learning framework. BigDL is a distributed deep learning library for Apache Spark™ developed inside Intel and contributed back to the open source community. BigDL is implemented as a library on top of Spark, so that users can write their deep learning applications as standard Spark programs. As a result, it can be seamlessly integrated with other libraries on top of Spark (e.g., Spark SQL and Dataframes, Spark ML pipelines, Spark Streaming, Structured Streaming, etc.), and can directly run on top of existing Spark or Hadoop® clusters.

Urika-XC Software – An Introduction

Data analytics is a significant and growing segment of high-performance workloads throughout government, academia and industry. The Urika-XC suite is a powerful set of tools that accelerates this convergence of big data analytics and HPC by allowing graceful deployment of analytics and scalable deep learning workloads alongside traditional simulation-based workloads on Cray XC series supercomputers.

The Urika-XC suite boasts a carefully curated software stack designed to support the needs of a high-performance data scientist with productivity tools to simplify and accelerate job execution. Apache Spark serves as a core data processing application, with Scala, Java, Python and R top level APIs. A large dataset can be effortlessly distributed among hundreds of nodes, bringing dataset preparation wait times down from days to hours. Paired with Spark is the Intel BigDL deep learning package built on Spark, allowing for a seamless transition from dataset curation to model training to inference. Other packages include the Anaconda Python distribution, TensorFlow, Cray Graph Engine and Dask Distributed.

Tutorial: End to End Workflow with BigDL on the Urika-XC Suite

In the rest of this blog post, we will showcase the capabilities of BigDL on the Urika-XC suite by demonstrating an end-to-end workflow featuring data preparation, model training and inference, all within Spark. We will also investigate the scalability of BigDL on Cray XC supercomputers by running an example training model at various node counts.

Using BigDL in the Urika-XC Suite

In a slurm environment, first load the analytics module and allocate a cluster of size n+2. This defines a cluster of size n with 1 interactive node and 1 spark-shell node.

Urika-XC software also supports running BigDL through Jupyter notebooks, and monitoring BigDL training with TensorBoard.

The Precipitation Nowcasting Workflow

With these tools at your disposal, data analytics workloads can employ machine learning and deep learning technologies as an integrated workflow. We have developed such a workflow intended to capture all stages of deep learning via Apache Spark and Intel BigDL, all visualized with Jupyter Notebooks and analyzed with TensorBoard.

The intent of this workflow is to predict short-term precipitation with a deep learning approach. This requires rapid processing of over a terabyte of raw radar scan data with Apache Spark to convert it into NumPy ndarrays represented as an RDD ready for ingestion into BigDL. To illustrate these tools, we’ve provided screenshots of this workflow including data processing, model training, inference and model analysis with TensorBoard.

Data Processing

Our first screenshot shows the Nowcasting data preparation in Spark running in a Jupyter notebook on the Urika-XC suite:

Model Training

Once the data is prepared, we can begin using BigDL to train a convolutional LSTM neural network to predict short-term precipitation using current radar images. Using Jupyter, we can also plot the training loss over time as the model training progresses:

Inference

Once our neural network is trained, we can use it to predict precipitation patterns for the next hour:

Model Analysis with TensorBoard

The Urika-XC software suite also ships with TensorBoard installed. TensorBoard allows you to visualize additional detail about your network with both TensorFlow and BigDL:

Scalability with BigDL

We also investigated the scalability of BigDL on the Urika-XC suite. The following figure shows the results of training ImageNet Inception V1 on dual-socket Intel Broadwell 2.1 GHz (CCU 36 and DDR4 2400) nodes, starting at 16 nodes and scaling up to 256 nodes. We set the learning rate to 0.10 and Spark’s executer memory to 120 GB.

Our first trial (shown in red) saw reasonable scaling up to 64 nodes, but performance plateaued at 96 nodes and began to decline soon after. We worked with Intel to identify scaling bottlenecks relating to gradient compression and synchronization in the Block Manager. Intel’s BigDL team fixed these bottlenecks in their 0.3 release (which is packaged with Urika-XC 1.1 software), resulting in the greatly improved scaling curve shown above in green.

Wrapping Up

The integrated AI and analytics stack in the Urika-XC suite, combined with the scalable Spark-based deep learning capabilities of BigDL, allow data scientists to construct complex workflows that harness the power of Cray’s supercomputers. More details about Urika-XC software can be found on the Cray website. BigDL is available on github.

Radhika Rangarajan from Intel and Omid Khanmohamadi and Mike Ringenburg from Cray also contributed to this blog post.

Speak Your Mind

Your email address will not be published. Required fields are marked *