We felt that it was necessary to write a book reintroducing programming techniques that should be used by application developers targeting the current and future generation supercomputers. While the techniques have been around for a long time, many of today’s developers are not aware of them. Let us explain.
The supercomputer has been a shifting target for application programmers ever since its inception in the form of Seymour Cray’s CDC 6600 in the early 1970s, forcing developers to adapt to new approaches along with the ever-changing hardware and software systems. This necessity of developer adaptation is especially conspicuous in the field of high-performance computing (HPC), where developers typically optimize for the target node architecture to squeeze out every last ounce of available performance. This continued through the CDC7600 and the vector-rich systems: Cray 1, Cray XMP, YMP, and C90.
Then, the “attack of the killer micros” in the 1990s threw everything into a tizzy. With increasing node counts, application developers had to consider PVM, and then MPI, in their quest to parallelize their applications across the multitude of nodes consisting of commodity off-the-shelf (COTS) chips without any vector instructions. As can be seen in Figure 1, the COTS chips got faster as their clock cycle time decreased between 1995 and 2010. Application developers no longer had to be concerned about the node architecture.
With the advent of AVX256, AVX512, SVE, and GPUs in the last five years, vectors have begun to come back. More recently, many-core systems like Intel’s Knight’s Landing (KNL) as well as attached accelerators such as Nvidia’s line of GPUs required a reexamination of the application to get better performance from the new, more-powerful nodes. Since the application could not always be vectorized and / or parallelized automatically by the compiler, the application developer had to do something.
That something ranged from writing important kernels in specialized programming models like CUDA for the Nvidia GPUs, to using compiler directives to help the compiler translate the input user-level code into low-level vectorized code for the processors. Then there was the issue of using all those cores on the node / generating thousands of threads for the GPU. Because running MPI across all the cores on and off the nodes worked pretty well, the developers for the initial multicore systems didn’t really need to get into parallelizing with shared memory threads. However, threading on Nvidia accelerators was absolutely required. OpenACC and then OpenMP 4.5 were developed for a performance-portable solution for threading and vectorization for the GPU.
Another revelation in new architectures is that memory hierarchies are becoming more complex, especially if an application has a large memory footprint. Knight’s Landing, as well as nodes with attached GPU accelerators, have two levels of memory which introduces new challenges that must be addressed by programmers. Figure 2 shows the differences between the memory hierarchies of KNL and the hosted GPU. If an application fits into the high-speed memory, then it will enjoy excellent memory performance. However, if the application requires more memory, the data set must be managed to flow between the two in an efficient and timely way.
The last four to five years have been a culture shock to those developers who insisted on productivity over performance. They are now faced with the serious challenge of effectively utilizing these new powerful nodes. If they continue with all-MPI codes and do not perform the necessary conversion to multi-level parallel code (through vectorization and threading), their performance will be a function of just the number of cores on the node and the clock cycle. While the number of cores on the node is going up slowly, the clock rate of those cores is going down, resulting in poor returns on new hardware investments without software optimization.
Our book Programming for Hybrid Multi / Manycore MPP Systems discusses the architecture of the new nodes and programming techniques application developers can employ to glean more performance. It will be extremely useful for those developers who are now faced with the significant challenge of getting increased performance from the vector capability of the node as well as improved scaling across the increased number of cores or threading for the GPU. The book also looks at the memory hierarchy of the KNL and discusses various approaches for managing the data. Finally, the book looks to the future, which is more of the same in many ways: more cores with wider vectors as well as more complex memory hierarchies.
If you’ll be at SC17 in Denver, join us at Cray booth #625 for a meet-the-author session at 2 p.m. on Tuesday, November 14. You can purchase a book at the CRC Press booth #811 for a 30% discount during the show.