Today, Cray, NERSC (the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory) and Intel announced the results of a three-way collaboration around CosmoFlow, a deep learning 3D convolutional neural network (CNN) that can predict cosmological parameters with unprecedented accuracy using the Intel-powered Cray® XC™ series “Cori” supercomputer at NERSC.
Supercomputers are unique in their ability to be instruments of discovery for problems on the smallest and largest of scales — from subatomic scale to the cosmos. Cosmologists who study the origin, evolution and eventual fate of the universe use a combination of empirical observations and theoretical computer simulations to define and refine a model of our understanding of the universe. At the core of the model are cosmological parameters describing the global dynamics of the universe, many of which can only be estimated through simulation models — and now deep learning.
CosmoFlow is an evolutionary step, refining earlier work by a team at Carnegie Mellon University who proved that it was “possible to estimate … cosmological parameters directly from the distribution of matter” using 3D convolutional neural networks. CosmoFlow is built as a 3D convolutional neural network using large 3D datasets (generated by N-body simulations) — a combination usually avoided in typical projects using 3D data — where 3D data is converted to 2D because of the compute and I/O demands of 3D.
What makes CosmoFlow unique is the unprecedented level of accuracy achieved when compute resources at the scale of the NERSC Cori system are brought to bear. The three parameters — the density parameter describing the proportion of matter in the universe (Ωm), matter density fluctuations on scales of 8 (σ8), and the power law index of the density perturbation spectrum after inflation (Ns) — require an enormous amount of data and compute power to estimate. Using the CosmoFlow model, the research team showed that the CosmoFlow CNN could estimate the values of Ωm and σ8 to the same level of accuracy as existing experiments, and Ns significantly better than previous uses of deep learning for estimation (5x less error than previous measurements).
This is all well and good as far as the cosmology is concerned, but equally exciting were the breakthroughs in the use of large-scale supercomputing that enabled these scientific results. I’d like to share three reasons why the CosmoFlow results are a big deal to us here at Cray:
- Extreme sustained performance. The team was able to perform fully synchronous data-parallel training on 8,192 nodes of Cori, “achieving 3.5 Pflop/s sustained performance.”
- TensorFlow at scale with a science application. TensorFlow is the most popular of the widely available machine learning frameworks. Without making any modifications to TensorFlow, the team believes CosmoFlow “is the first large-scale science application of the TensorFlow framework at supercomputer scale with fully synchronous training.”
- Something that couldn’t be done otherwise. Without the power of a supercomputer, this work simply could not be performed. Most deep learning exploration today is performed on small, single-node systems. In the authors’ estimate you’d need “more than 60 days of execution time on a single node to converge to a model at an acceptable loss.” The CosmoFlow run using 8,192 nodes “took roughly 9 minutes total with 8 minutes of training time.”
From a Cray perspective, I also want to highlight two technologies that underpinned the achievement of these results: the Cray PE Machine Learning Plugin and the DataWarp I/O accelerator.
- The Cray PE Machine Learning Plugin, a part of the Cray® Urika®-XC Analytics and AI suite, dramatically improves the scalability and performance of TensorFlow distributed training. The Cray PE ML Plugin replaces the TensorFlow socket-based gRPC communications and the associated parameter server /worker architecture with an MPI-optimized communication mechanism, and implements speed-up algorithms for synchronous stochastic gradient descent training. Using the Cray PE ML Plugin, the team was able to use 8,192 nodes to do fully synchronized data-parallel training, where previous efforts on Cori had encountered significant scaling issues.
- Running distributed model training isn’t just a compute problem, it’s also a data I/O problem, as each compute node has to read data in parallel. The Cori system at NERSC has both a native Lustre storage system and a “burst buffer” file system comprised of Cray DataWarp I/O accelerator nodes. Using DataWarp, the CosmoFlow I/O performance was 16% better than native Lustre at 128 nodes, and greater than 30% better at 1,024 nodes.
I’d be remiss if I didn’t also mention the work performed by Intel to tune and optimize the Intel MKL-DNN libraries for the Intel® Xeon Phi™ processor nodes used on the Cori system. Much has been written in the industry about the use of GPUs for deep learning neural network training, but the fact is that CPUs — both multi-core and many-core — are more than adequate to perform CNN training, especially when paired with a scalable network infrastructure like the Cray-developed Aries™ network on the Cori XC series system.
This work was performed under the auspices of the NERSC Big Data Center at Lawrence Berkeley National Laboratory (LBNL) that Cray joined in 2017. The center focuses on projects around scientific applications using big datasets (~100 TB) and large-scale compute (~100,000 processing cores). On the Cray side, the team included Dr. Peter Mendygral, Dr. Michael Ringenburg, Dr. Diana Moise and Dr. Kristyn Maschhoff. Great work, team!
Read more from NERSC: NERSC, Cray, Intel Harness the Power of Deep Learning to Better Understand the Universe.