It’s hard to truly understand the capabilities of modern-day supercomputers, their constituent technologies and deeply evolved software stacks. Once in a while, a technologist (or a marketing person) will refer to the “Cray inside my cellphone” or compare IBM’s 1980s DASD storage (pronounced dasdee by old-timers) with today’s solid-state storage devices (likely found inside a cellphone as well), but those comparisons are typically made to marvel at today’s cellphones rather than to marvel at today’s supercomputers. (In fact, today’s supercomputers now have aggregate processing and storage capabilities that would easily translate to millions of those cellphone chips and SD cards.)
Impressive as that “millions of cellphone processing units” number may be, it begs a lot of questions. One in particular is not trivial: What holds those processing units together and allows them to work as one? (Even Seymour Cray had his doubts it could work, and joked, “If you were plowing a field, which would you rather use: two strong oxen or 1,024 chickens?”) What holds them together “electrically” is, of course, a high-performance interconnect, and the “glue” that makes them work together is dedicated communication software and libraries. But how does one describe those supercomputing capabilities in terms of technologies and capabilities that nonspecialists will appreciate?
Let’s try to comprehend the internet for a perspective. According to Internet Live Stats, a lot is happening on the internet at any given time and in the aggregate. Indeed, every second, more than 7,000 tweets go out; more than 2.5 million emails are sent (mostly unread or junk, I’m assuming); more than 56,000 Google searches are processed and more than 37,000 GB/s of data get moved. The latter number translates to about 370 4K movies downloaded per second, every second, or 32 million such movies downloaded per day. All that traffic takes place on a backbone built from and maintained by Tier 1 carriers — many hulking, multi-shelf, industrial-refrigerator-sized routers that can each route data at close to 1,000 Tbps (terabits per second) or about 128 TB/s (terabytes per second).
How do the interconnects of today’s supercomputers stack up against those metrics and that traffic? A checkpoint came out of “Trinity,” the Cray supercomputer at the Los Alamos National Laboratory’s Advanced Simulation and Computing (ASC) program. First, Trinity (pictured) does not have those Tier 1 routers. The refrigerator-size cabinets you see in the picture mostly contain number-crunching processing equipment (in Trinity’s case, as of today, more than 6,000 Intel® Xeon Phi™ (aka KNL) processors). However, Trinity does have routers. Those are Cray-developed and relatively small Aries™ routers that together with network interfaces and wiring constitute the “backbone” for Trinity. In fact, Trinity has more than 1,500 of those Aries routers, one for every four servers. Each of those routers has 48 ports and provides about 500 GB/s of switching capacity. (That is five 4K movies per second for you movie buffs). Of course, Trinity’s backbone, unlike the internet’s backbone, does not span the globe, which allows it to run quite a bit faster and to respond a lot quicker, but, remarkably, Aries — just like the internet routers — does have adaptive and resilient packet routing with guaranteed delivery and other internet-like goodness.
OK. That’s the theory. How does it work in practice? The Los Alamos/Cray team recently ran some communication performance tests: close to 8,900 of those processors (8,894, to be exact) in 48 of those cabinets shown started micro-tweeting to each other in some of the most challenging communication patterns. (Actually, to make it even more challenging, each one of those KNL processors started 225 virtual broadcasting threads to all other processors’ virtual threads at the same time.) This was followed by a little bit of MapReduce™-style work and finished by syncing up.
What did those threads broadcast to each other? Think of those as micro-tweets — two to eight characters’ worth of data. In supercomputer communication lingo, there were at any given time about (225 * 8,894 =) 2,001,150 “MPI ranks” maintained in an MPI_Allreduce (8 bytes float), MPI_Barrier, and MPI_Bcast (4 byte int) activity.
(All of this was done via application software from a high-level program. By the way, this is very likely a record. It would be interesting to see numbers for the Chinese supercomputer “Sunway TaihuLight,” a more massive system than Trinity, and the current No. 1 system on the Top 500 with 93 PF on the LINPACK benchmark.)
In about 5.6 ms, 2 million micro-tweets were sent out, read and processed, and receipt was acknowledged. Per second, that would translate in about 360 million micro-tweets. Compare that with the 7,000 tweets per second on the internet: Each tweet can be up to 140 characters, so that makes for (generously) 500,000 micro-tweets per second across all of the web, which is a factor of 700 short of Trinity’s performance. Of course, that’s not a fair comparison for many reasons (one of them being the span of the interconnect), but it does illustrate the kind of capabilities that a tightly coupled, special-purpose interconnect can bring to customers that require the highest modeling and simulation capabilities. For a much more in-depth description of the advantages of the Cray XC series Aries interconnect executing at scale, read our new white paper.