Today Cray is announcing our next-generation “Shasta™” supercomputer, featuring our latest generation of scalable interconnect, code-named “Slingshot.” As a long-time network architect, I’m pretty excited.
Slingshot is our 8th major generation of scalable HPC network, and there have been some great milestones along the way. We started back in 1992 with the Cray T3D, Cray’s first massively parallel system. Implemented in BiCMOS, its network latency was just 12 ns per hop. That was followed in 1996 by the pioneering Cray T3E system, which had the first-ever implementation of adaptive routing in an HPC network (by a long shot!). In 2005, Cray pioneered the design of high-radix switches. Our YARC switch for the Cray X2 implemented 64 ports using a unique tiled architecture, enabling the creation of very low-diameter networks. The Seastar network in 2004 ushered in Cray’s XT line of MPPs. Lastly, the Aries™ network shipping in our current XC™ line of supercomputers was the first to implement the dragonfly topology, providing significant latency and cost reductions. Others have slowly picked up on these innovations over the years, but Slingshot is breaking new ground again.
Slingshot has a crop of new features aimed at data-centric HPC and AI workloads. It starts with extremely high bandwidth: 12.8 Tb/s/dir per switch, from 64 200 Gbps ports. All those ports allow us to build very large networks with very low diameter. Slingshot implements the dragonfly topology, which Cray invented back in 2008. We can create networks with over a quarter million endpoints, with a diameter of just three network hops. And only one of those hops needs to use an (expensive) optical cable. This reduces latency, but also reduces cost; the Shasta network requires half—or for larger systems, one quarter—the number of optical cables as a fat-tree to provide the same level of global bandwidth.
That provides great peak price-performance, but it’s how Slingshot uses that bandwidth that really sets it apart. As students of interconnects understand, the latency that matters isn’t fall-through latency on an idle network (don’t get me started on the ping pong benchmark!); it’s latency under load, at scale, which primarily comes from queueing latency. So the key to low latency is avoiding congestion and queueing in the network.
One way Slingshot avoids congestion is with adaptive routing. We’ve been refining this for twenty years now, and Slingshot’s very low network diameter allows extremely responsive adaptive routing; each switch has a good view of the overall state of the network, so can make fast, well-informed decisions about optimal paths to take to avoid temporary congestion. The network can optionally preserve packet ordering while adapting routes, or provide even higher performance by allowing each packet to adapt separately. The adaptive routing in Slingshot lets us sustain well north of 90% utilization, even at large scale, for well-behaved workloads.
But there are some cases where adaptive routing can actually hurt performance. When a set of nodes sends too much traffic to an endpoint, exceeding its egress bandwidth, traffic can back up into the network over a branching tree of links targeting that endpoint (known as tree saturation). In this case, adaptive routing can make matters worse, by causing blocked traffic to spread to other links in a futile attempt to route around the congestion. This is a classic network problem, and all current HPC networks (including current and past Cray networks) are vulnerable to it. Datacenter Ethernet networks attack the problem by dropping packets when local congestion rises above some threshold, and then using some form of congestion control to try to stop the transmission of excess bandwidth into the network. But dropping packets isn’t acceptable in HPC networks (it creates very high software overheads to detect and retransmit packets, and causes terrible spikes in latency). And congestion control mechanisms (like ECN and QCN) tend to be fragile, hard to tune, slow to converge, and generally unsuitable for dynamic HPC workloads.
This is where Slingshot’s most important feature comes in. It implements an innovative congestion control mechanism that’s extremely responsive, requires no tuning, and is stable across a wide range of dynamic HPC workloads. What this means is that any source, or set of sources, that tries to send more data into the network than can be delivered, is very quickly told to back off to avoid wasting buffer space in the network. But more importantly, the backpressure affects only the offending data sources, not all the rest of the “innocent” traffic that might be sharing some links with the congestion-causing traffic, but is otherwise destined to an endpoint that can accept it. As the congestion clears, the participating sources are enabled to ramp up their transmission, at just the right bandwidth to keep the bottleneck link(s) fully utilized, without causing bubbles in the bandwidth. And all this happens automagically in hardware with zero software setup, using some incredibly sophisticated machinery that tracks everything going on in the network on a per-packet basis. It’s pretty cool.
The result is that Slingshot provides highly effective performance isolation between workloads. No longer does the poorly written application cause congestion that interferes with other workloads on the system. And this means that latency variation is dramatically reduced in the network. Slingshot’s focus on bringing down tail latency (the latency that the slowest 1%, 0.1% or even 0.01% of packets experience) is key to making latency-sensitive and synchronization-heavy applications perform well. It can have a pretty dramatic impact on the performance of these applications.
In fact, we believe that looking at tail latency in loaded networks, and at interference between applications, is important enough that network benchmarks really ought to measure those aspects explicitly. Cray is working with some customers on the creation of an enhanced set of network benchmarks that will provide a much better indication of how a network will perform under challenging real-world conditions. It wouldn’t make sense to estimate your workday commute time by driving the route at 4 a.m., and it certainly doesn’t make sense to evaluate network performance by exchanging packets on an otherwise idle network.
In addition to the adaptive routing and congestion control mechanisms, Slingshot also provides traffic classes that provide a great amount of flexibility in terms of bandwidth shaping, priority, and routing policy. These can be used to create a set of overlaid virtual networks with widely varying attributes, allowing customization for control traffic, IO traffic, high and low priority compute traffic, etc. This makes the network ideally suited for next-gen, data-centric systems running a dynamic mix of simulation, analytics, AI, and data-management workflows.
The Cray Shasta system also supports a wide variety of node types, with anywhere from one to 16 nodes per compute blade. Unlike our previous designs, Slingshot can support a variable number of injection ports per node, as well as optional tapering of global bandwidth, to flexibly match the cost and performance of the network to the target workloads.
Another increasingly important consideration for future data-centric systems is interoperability with storage and datacenters. To enhance interoperability, we made Slingshot Ethernet compatible. Slingshot switches can connect directly to third-party Ethernet-based storage devices, and to datacenter Ethernet networks. Applications running on Shasta compute nodes can directly exchange IP/Ethernet traffic with the outside world, making it easier and more efficient to ingest data from external sources. But standard Ethernet has high overheads and is poorly suited for HPC workloads. For that reason, we developed an optimized HPC Ethernet protocol that has smaller headers, support for smaller packet sizes, credit-based flow control, reliable hardware delivery, and a full suite of HPC synchronization primitives. Slingshot uses the optimized protocol for internal communication, but can intermix standard Ethernet traffic on all ports at packet-level granularity. This best-of-both-worlds approach allows the Shasta system to comfortably straddle the supercomputing and datacenter worlds.
The Shasta supercomputer represents a major advancement in the flexibility and capability of Cray supercomputers, and will be the basis for our converged architecture for simulation, analytics, AI and data management over the next decade or more. It needed a suitable interconnect as its backbone, and Slingshot will do the trick nicely!