In January, a team of Cray developers and researchers published a paper, “Recombination of Artificial Neural Networks,” on arXiv.org, highlighting the hyperparameter optimization (HPO) capability Cray announced in November. We cover their findings in this blog post.
Using a variety of high-performance computing systems and neural network models, the Cray team demonstrated that the hyperparameter optimization capabilities introduced in the Cray® Urika®-CS and Urika®-XC AI and analytics software suites improve the time-to-accuracy as well as final accuracy of machine learning models trained on Cray systems.
The table below, excerpted from the paper, highlights the improvements achieved using Cray’s HPO capability across a range of neural network types.
The What and Why of Hyperparameter Optimization
Hyperparameter optimization is the process of tuning the hyperparameters of a model. It essentially means searching through an enormous universe of possible combinations of hyperparameters (HP) for the set that optimizes the desired figure(s) of merit. Because the search space of HPs is often larger than the number of atoms in the observable universe, manual and even brute-force approaches are frequently intractable.
The reason for hyperparameter optimization is simple: without automated HPO, data scientists can fall into the trap of using “default” settings in some of the worst cases, or at least settings based on limited intuition, experience, and/or empirical evidence in some of the best cases. This approach is particularly challenging when machine learning practitioners build new models. A poor choice of hyperparametres may result in a model that is not even able to converge.
Over time, a few standard approaches for working through the range of possibilities have appeared, including grid search, random search, Bayesian optimization, and evolutionary optimization. At Cray, we’ve implemented the standard grid and random search approaches as a baseline. We also offer our advanced HPO approach based on a genetic algorithm (GA). For intuition, it might help to think of genetic algorithms as “iterative, parallel, stochastic grid search with pruning” (to quote the paper).
To get more information, I asked Aaron Vose, the lead developer of our genetic algorithmic approach, to answer some questions I had regarding hyperparameter optimization, genetic algorithms, our implementation of Population Based Training (PBT), and the results written about in the arXiv paper.
Aaron is one of many developers and researchers working at Cray on problems related to AI and HPC, where he can leverage his multidisciplinary background in biology and computation, having worked as a research assistant and research associate in the ecology and evolutionary biology department at the University of Tennessee in Knoxville, and later as a software engineer at the Cray Center of Excellence located at the Oak Ridge National Laboratory before joining our AI R&D team.
Can you describe what a genetic algorithm is?
Genetic algorithms are bio-inspired techniques commonly applied to search and optimization problems. They typically utilize some combination of nature-inspired operators, e.g., mutation, crossover, and selection. A genetic algorithm evolves a population of candidate solutions over time, measured in generations, to improve some figure of merit (FoM).
Why are genetic algorithms a good fit for HPO of deep neural networks?
While a number of search algorithms can be applied to HPO in the deep learning space, each comes with advantages and disadvantages. Simple approaches (e.g., random search and grid search) are easy to use and understand but often spend a lot of computation time searching in undesirable locations of the hyperparameter space. Thus, “smarter” search algorithms — algorithms that can prune undesirable regions of the HP search space — are strongly desired.
There are many qualities and features of HPO algorithms which are often important, including but not limited to the ability to:
• Prune undesirable regions of search space
• Support for categorical hyperparameters in addition to integer and continuous hyperparameters
• Parallelize the HPO algorithm and take advantage of HPC scales
• Create a hyperparameter schedule across stochastic gradient descent (SGD) epochs, providing optimized hyperparameters for many points in the training process
• Support optimization of different FoMs such as final accuracy, training time to accuracy, inference time with accuracy, etc.
Genetic algorithms can support all these features and more. They are incredibly generic, and can thus support any hyperparameter one can reasonably define. Similarly, as long as you can eventually map the FoM you want to optimize to a single real-valued number, you can optimize that FoM.
Are genetic algorithms limited to a particular type of neural network model or system?
No. We’ve proven that the exact same genetic algorithm, with essentially the same settings, can successfully optimize a diverse number of different neural network architectures including convolutional neural networks, recurrent neural networks, dense neural networks, and capsule networks.
Not only that, but the HPO capability is portable, able to scale to hundreds of CPU- and GPU-based compute nodes on HPC clusters and supercomputers. In our paper, we demonstrate the use of our genetic algorithm to optimize hyperparameters, including neural network topology on the CANDLE (CANcer Distributed Learning Environment) benchmark, the RPV neural network trained on Pythia+Delphes ATLAS Data, and LeNet trained on MNIST. In addition, we also demonstrate the use of our genetic algorithm with PBT to find an optimized training schedule for ResNet-20 on CIFAR-10, Capsule Net on CIFAR-10, and a neural machine translation (NMT) network trained to translate from Vietnamese to English using a sequence-to-sequence (Seq2Seq) recurrent neural network (RNN) architecture.
Are genetic algorithms unique to Cray?
No. Researchers have been writing about genetic algorithms for neural network hyperparameter optimization for a while. What we have done is develop a genetic algorithm that supports advanced features which are frequently missing from competing approaches.
What are some of the unique features in the Cray genetic algorithm?
Our genetic algorithm provides advanced features such as the formation of children from three parents during sexual reproduction (i.e., crossover). One parent provides parameters and two provide hyperparameters. This allows optimization of both parameters and hyperparameters at the same time, as well as provides an increased speed of adaptation and a greater ability to shed deleterious genes from the population. This all translates to a genetic algorithm which requires less computational resources while providing improved HPO results.
Machine learning practitioners should be aware of PBT (Population Based Training), a hyperparameter optimization technique that came out of DeepMind around the end of November 2017. While their initial paper doesn’t seem to call it out as such, PBT appears to belong to a class of formal genetic algorithms. One of the most interesting aspects of PBT approaches is that they can: 1) optimize hyperparameters across SGD epochs to give a training schedule over time, and 2) optimize both hyperparameters as well as parameters (remember that at a minimum, viability selection is applied to a combined set of parameters and hyperparameters). Cray’s genetic algorithm supports running in PBT mode, bringing advanced genetic algorithm features over what is available in vanilla PBT, including support for sexual reproduction utilizing crossover of three parents; the original / traditional PBT approaches provide only asexual reproduction with no recombination at all.
The team included Aaron Vose, Jacob Balma, Alex Heye, Alessandro Rigazzi, Charles Siegel, Diana Moise, Benjamin Robbins, and Rangan Sukumar.