This week at the SC18 supercomputing event, one of the exciting areas we’re highlighting is how Cray continues to expand its artificial intelligence portfolio to support researchers, data scientists and IT teams as machine and deep learning become core to their everyday missions. As we’ve spoken to organizations starting out with AI, developing models, doing real research, or implementing production systems, we’ve noticed a few themes:
• Machine learning is as much an art as it is a science, and a time-consuming endeavor at that. Automation tools that help the data scientist get to the optimal solution faster are much appreciated.
• No single “right” tool set exists. Some teams prefer TensorFlow, some PyTorch, some Apache Spark™ and some use Python- or R-based tools. People use what they want to use.
• Keeping it simple is the killer app. Making it complex is the app killer.
To address some of these common themes, we’re adding capabilities to our Cray® Urika® suites (Urika-XC and Urika-CS), expanding the environments that are available for our Cray® CS™ and XC™ series supercomputers, and joining the NGC Ready program.
New Capabilities for the Data Science Artist
In machine learning, there are many different decisions and options that go into defining a model to solve a problem. Data scientists’ task is to find the most appropriate model and hyperparameters (settings) for the problem at hand.
Today, Cray is adding support in our Urika AI & Analytics suites for distributed hyperparameter optimization (HPO). We are including two commonly used strategies and two advanced capabilities to our distributed training framework. Coupled with our Cray PE ML plugin or Horovod for distributed model training, we have what we believe is the most advanced set of capabilities for machine learning and deep learning model development at scale.
To be clear, hyperparameter optimization isn’t a novel concept. Data scientists always try to determine the best setting for the hyperparameters that define the model design and training process. Commonly, they just use default values or intuition, and through cumbersome trial and error find sufficient settings (hyperparameters). What often limits experimentation is the time required to train a model given the resources available. By adding HPO to our distributed training framework, we are putting the power of Cray supercomputing into the hands of the data scientist.
We added two commonly used HPO search strategies (random and grid) and, Cray being the boundary pusher it is, we also included a Cray-developed HPO search strategy based on the notion of a genetic algorithm and tooling for a training technique known as population based training (PBT; see the Google DeepMind blog post on this approach). We are excited to provide a PBT technique powered by our genetic search algorithm, because we believe when used individually or together, the Urika suite offers the most complete set of HPO strategies available today in an integrated approach. For example, a data scientist looking for a smaller model – say for an embedded end-point deployment – can use one of our strategies (genetic) and approach (PBT) – to search a model space and find a model, with a similar level of accuracy, to a larger model suited for server deployments.
The Right Tools for the Data Scientist
When we first introduced the Urika suites, we deliberately chose TensorFlow, Apache Spark and the Anaconda Python tool set because of their popularity. We are adding support for three more popular open-source frameworks: Keras, PyTorch and Horovod.
Keras and PyTorch are obvious, as they regularly appear at or near the top of most lists for AI frameworks (take a look at this Dataquest article on the top 20 Python AI and machine learning open-source projects). Horovod may not seem as obvious, but it was selected because Horovod works really well on Cray CS-Storm GPU-accelerated systems.
Keeping It Simple
The state of the art for AI and HPC moves very quickly. Systems such as our Cray CS series are ready to tackle the biggest compute jobs, but there is complexity and overhead associated with setting up, managing, and maintaining software stacks. It’s one reason why a container-based approach to software deployment has become so popular and also why we created our Urika-CS AI & Analytics suite.
Our friends at NVIDIA recognize the same problem and are taking a similar approach for software deployment of GPU-accelerated applications.
For Cray customers looking to use a GPU-accelerated system, we are partnering with NVIDIA on their new NGC Ready program. Our customers will be able to run GPU-accelerated software from the NVIDIA GPU Cloud (NGC) container registry with confidence on our CS-Storm 500NX system.
The intent of the NVIDIA NGC container registry is to simplify and accelerate projects. The NGC container registry is a cloud-based catalog of GPU-accelerated software featuring ready-to-run containers for AI and HPC that are tuned, tested and optimized across the stack to take full advantage of NVIDIA GPUs. These containers simplify and accelerate your projects, helping you get the most from your hardware investment.
The NGC container registry provides performance-engineered containers for many popular AI and HPC software packages including NVIDIA TensorRT™, RAPIDS™, NAMD, GROMACS, ParaView, NVIDIA IndeX, NVIDIA Holodeck and many more.
If you are in Dallas this week attending the SC18 conference, please come by our Cray booth #2413. We’d be happy to show off our HPO capability, as well as answer any questions you might have about our Urika suites and NGC Ready participation.