Like you, we’re working on how to approach artificial intelligence (AI). We all know it’s here, it’s growing and it’s not a passing trend. We could debate all day over how fast and far the use of AI will spread. But let’s set that discussion aside and focus on what we’re actually seeing when we’re out talking to customers and prospects — and then what we’re doing with that information.
What we’re seeing isn’t entirely what we anticipated when we announced our Cray Accel AI fast start solutions last November.
At that point, we knew we had customers whose AI plans ranged from “just starting out” with machine learning (ML) and deep learning (DL) to those running production-level AI applications with concerns about future growth. Our customers were telling us that they needed choices from small to large.
In response, we launched our Cray Accel AI offerings. These are configurations of our Cray® CS-Storm™ supercomputer coupled with software specifically meant for AI, ML and DL. They’re packed with AI-enabling features — eight NVIDIA® Tesla® V100 GPU accelerators powered by NVIDIA® Volta™ GPU architecture and a deep learning environment from Bright Computing that includes TensorFlow™, MXNet, Caffe2, Chainer, Microsoft Cognitive Toolkit and more.
What makes these solutions so well suited to AI/DL today is that they’re architected so you can add to them incrementally. You can start small with a pilot system for model development and testing and build on it all the way through to a complete production solution for data preparation, model development, training, validation and inference.
But here’s what we learned…
Companies know they need to implement AI in their business process but they don’t yet know to what extent. They’re being told AI will change the way they do 30 to 50 percent of their work in the coming years (maybe two, maybe five, maybe ten years). That lack of true clarity on what AI will mean for them makes the AI question difficult to approach. How much do I invest? When? What will it look like?
While some users are well on their way with AI and are setting up production-level equipment, they’re not as prevalent as the popular press would have you believe. The majority wants and needs to start small.
It turns out our definition of small wasn’t small enough.
Part of the challenge with AI — deep learning in particular — is matching the system to the models being used for a particular use case. On the surface, a system that features eight GPUs is a processing beast able to handle any use case thrown at it. In reality, developing a deep learning model is a bit of an art with the artist — the data scientist — mixing and matching data to models in an effort to achieve accurate results in a timely fashion. And sometimes, the balance between I/O and compute power isn’t quite right (i.e., too little I/O for so much compute).
So what did we do?
We added a new 4-GPU version of the CS-Storm 500NX system. With support for two CPUs and four GPUs, this system is well suited for applications ranging from deep learning neural network training and inference to HPC applications like reservoir simulation and cryo-electron microscopy. Adding a smaller form factor gives companies the flexibility to choose the exact node configuration they need.
Adding a smaller configuration option isn’t groundbreaking. We know that. Some deep learning use cases require a sledge hammer (a cluster using high-density GPU nodes with eight or ten GPUs per node) while others require a club hammer (a cluster using low-density GPU nodes with four GPUs per node). The key here is matching tools with tasks.
So if you’re going to do DL — and you will — understand this: You need a system that grows as your use grows, and a system that matches the unique characteristics of your application design.