Deep Learning Invades HPC
While many algorithms are commonly referred to as “machine learning” (ML) or “artificial intelligence” (AI), deep learning with neural networks (NNs) has dominated the attention of the ML industry in recent years. Though numerous alternatives exist – including support vector machines, Bayesian classifiers, genetic algorithms, clustering techniques, and even decision trees – NNs have experienced a rapid increase in real-world effectiveness during recent years.
Continued improvements in computing hardware help propel the ongoing expansion in the use of NNs by many industries. In fact, the demand for larger and more-powerful neural networks motivates many to leverage the unique scaling advantages provided by high-performance computing (HPC), including Cray’s high-end clusters and supercomputers.
Specifically, current scaling to small node counts is no longer sufficient for today’s larger NN workloads, let alone the workloads of the future. That is, the machine learning industry is entering a historical inflection point, where a transition in scaling from tens and hundreds of nodes to much larger systems with thousands to tens of thousands of nodes will soon be required.
High-Level Neural Network Primer
NNs are particularly important for the task of “classification,” where an input item is put into the NN, and the NN outputs the class(es) to which it predicts the input item belongs. As a concrete example, consider the task of mapping images of hand-written characters such as those in the figure below to their numerical ASCII values. This task has the same generic input to classification mapping, but with the following more-specific input-to-output mapping: written_char_image → NN → char_ASCII_value.
A very simple fully-connected NN with a 96x14x26 topology (96 input neurons, one hidden layer with 14 neurons, and 26 output neurons) can classify these kinds of hand-written letters with over 95% accuracy quite easily, with few training items and little training time. With a bit more care and a small number of additional neurons, an NN can complete this task with an accuracy closer to 98% or better.
NN Training Is Computationally Demanding and a Perfect Fit for HPC
While the example presented in the preceding section is quite simplistic, the general task of image classification presented therein is applicable to neural networks as actually used in industry, from manufacturing lines to autonomous self-driving cars. These real-world NNs often need to deal with high-resolution, multi-color images. Even more challenging, these images are often derived from video, where each video frame needs to be processed.
Such large and growing datasets are pushing NN training toward resources which are part of the traditional HPC space. So, why are neural networks finding such a natural fit in the world of high-performance computing? There are three primary drivers for this trend:
- First, NN training utilizes lots of linear algebra, requiring massive amounts of floating-point operations (FLOPs). These needs are well met by the HPC space, with a long tradition of placing emphasis on floating-point performance.
- Second, training NNs on larger hardware systems hits scaling issues due to interconnect communication delays between nodes. The HPC space and Cray have traditionally placed a strong emphasis on very low-latency, high-bandwidth interconnects, which consistently out-perform networks used in other markets.
- Third, as training datasets continue to grow rapidly in resolution as well as number of training items, a “big data” scenario emerges, requiring systems with high-performance parallel I/O capabilities. The HPC industry and Cray have a long tradition of successfully providing high-performance and high-capacity scalable parallel I/O.
Where Is the Current Trajectory Likely to Send Us in the Future?
Unsupervised learning is sure to see increased representation in the future. Pretraining utilizing unsupervised learning can make use of unlabeled data items which lack matching classification information. Remember that normal training requires examples with the corresponding correct classification information, allowing the training of the NN to correct the NN’s internal weights in the event of a misprediction. With unsupervised (pre)training, an NN can know that a set of inputs is similar to what the NN may be likely to see in the future, but the NN will never have any way of knowing to which class(es) these inputs belong. While this lack of correct class information is a limitation, the primary advantage of unsupervised learning is that there is often much more unlabeled data available than labeled data. This moves pretraining into the “big data” regime.
Consider pretraining an NN with a restricted Boltzmann machine (RBM) and a huge unlabeled dataset, then doing the final training with a smaller set of labeled data. The pretraining step with the RBM will extract features that are statistically well-represented in the unlabeled data, and final training will make use of these high-quality features, increasing final accuracy.
The trend of scaling to larger node counts is inevitable. This transition represents a change in focus from optimization for a small number of fat nodes to a large number of thinner nodes. Part of this transition is related to the balance between compute node performance and interconnect communication performance. The best-trained NNs of the future will be those able to scale to supercomputer-class machines.