To describe the current trajectory of software optimization trends in HPC, we need to understand our recent past. Which recent trends dictate our current state? We also need to know where we are now. Which HPC software optimizations are currently commonplace? Finally, to what extent does the trajectory created by past and current HPC software optimization activities predict the future?
Where did we come from?
We’ll start our software optimization narrative by talking about two hardware trends: Moore’s law (the number of transistors in an integrated circuit doubles approximately every two years); and Dennard scaling (as transistors get smaller their power density stays constant). Moore’s law, providing smaller transistors coupled with the commensurate decrease in power consumption described by Dennard scaling, resulted in “free” performance increases for existing software. That is, software developers didn’t need to invest much in software optimization because their existing codes realized reasonable performance gains from new hardware. As long as Moore’s law and Dennard scaling continued to hold, application developers had limited impetus to optimize their codes at the node level.
While these “free” performance increases limited the need for investment in HPC software at the node level, scaling out to larger numbers of nodes required code changes. A significant portion of HPC codes capable of scaling to large core counts were parallelized with a message passing paradigm such as MPI. However, after well-performing MPI ports of applications were created, “free” performance gains began to reappear. By running one MPI process per core, existing MPI codes achieved acceptable performance during the beginning of the multi-core era. Existing software continued to gain “free” performance from faster cores as well as from scaling out to additional cores. During this era, developers found limited motivation to optimize codes at the node level.
Where are we now?
Current HPC software optimization activities are frequently related to finding and exploiting ever-increasing amounts of parallelism, as required by current and imminent hardware designs. As the number of cores per CPU socket and systems’ total core counts continue to climb, MPI-only codes stop scaling well due to communication overhead. To continue to improve application performance, codes need to be modernized to reduce MPI communication.
The rise of OpenMP
A common solution to the communication problem is to run a small number of MPI ranks per node and fill the rest of the CPU cores with threads. While there are a number of threading techniques, OpenMP has found incredible popularity in the HPC community. OpenMP makes exploiting existing loop-level parallelism relatively easy through the addition of compiler directives to existing loops. Increased performance can be achieved at a low cost to the developer as long as the code exposes enough of the underlying algorithmic parallelism.
The rise of OpenMP does, however, represent some amount of forced code optimization at the node level. While there are times when OpenMP directives can be easily added to existing loops to gain performance, code and loop restructuring is often required. As long as CPU core counts remain fairly low, finding enough parallelism for OpenMP can frequently be achieved without too much development effort, but we’re starting to see that codes may need to be modernized as new generations of hardware arrive.
The rise of OpenACC
A more recent solution to the communication problem is to use a single MPI rank per node and perform all computation on an accelerator. As Moore’s law slows and Dennard scaling begins to break down, simply increasing the core count of traditional multi-core CPUs is no longer sufficient to achieve additional performance per watt for many codes. This is evidenced by the rise of accelerator and GPU systems in the Top100 list. In much the same way that OpenMP found success largely due to its ease of use for threading on CPUs, OpenACC has succeeded because it can simplify the process of porting codes to GPUs.
However, “simplify” here is relative. The best way to put this in context is to compare OpenACC with CUDA. It’s generally fair to say that porting a code to CUDA requires a nearly complete rewrite of your code, or at least the computationally-intensive parts of it. OpenACC can sometimes avoid this rewriting due to its directive-based nature. But don’t be fooled by the hype: OpenACC is not a magic bullet, and it frequently requires code restructuring to achieve good performance. This is especially true with respect to the management of data motion between the host and accelerator devices, where the devices have a small but high-bandwidth local memory. The rise of OpenACC is a strong indication that codes need to be modified again to keep up with the hardware of the future.
The current trajectory
We see a reasonably clear trajectory formed from our recent past and the current state of HPC software optimization: the age of “free” performance increases has come to a definitive end. Codes already require increasingly significant amounts of modernization on current state-of-the-art hardware, and the next generation will only make this worse. While some HPC codes will continue to run efficiently with only minor changes, an ever-increasing number will require restructuring and modernization to keep up. Future hardware brings not only more threads and wider vectors, but also adds complexity to the memory hierarchy with the introduction of high-speed memories such as HBM and HMC.
To find out how to modernize your code for current hardware while making your optimizations future-proof, continue reading the second part of this blog series: “HPC Software Optimization Trends (Part 2): What Will the Future Be Like and How Do We Prepare for It?”