To what extent do past and current HPC software optimization activities predict the future? Current optimization trends and next-gen hardware designs point to more threads, wider vectors, and an increasingly complex memory hierarchy. Is HPC application software ready for the future foretold by this trajectory? What will it be like and how can you be prepared for it?
Where did we come from and where are we now?
For background about how we arrived at the current state of HPC software optimization, read part one of this blog series: “HPC Software Optimization Trends (Part 1): Where Did We Come From and Where Are We Now?” The key point is to realize that the recent past of HPC software optimization has been marked by relatively long stretches of time where developers have been able to ignore much of the node-level architecture. As long as your MPI application scaled out relatively well, new processors provided “free” performance improvements through both increased core performance and the addition of multiple cores.
Current HPC software optimization activities frequently relate to finding and exploiting ever-increasing amounts of parallelism. The all-MPI codes of the past need to be modernized to reduce the number of ranks and communication overhead. This is often done by utilizing a single MPI rank per node, with the majority of the computation being performed either by OpenMP threads or an OpenACC accelerator.
While perhaps the rise in popularity of classical OpenMP was merely a hint, OpenACC and OpenMP-4.0 indicate that the old days of getting “free” performance increases on new hardware are coming to an end.
What will the future be like?
Future processors with larger core counts shouldn’t come as a surprise. We started with the first general-purpose dual-core CPUs of the early 2000s and now see as many as 18 cores per CPU as the new normal. The industry-standard, general-purpose CPUs of the future will have more cores and support more threads per core. HPC software developers should plan accordingly for this high-thread-count future.
Accelerator architectures with wide vectors are nothing new. However, the trend of CPU vector widths is more revealing, as these cores were not designed from their inception for large vectors. Figure 1 shows the accelerating trend of double-precision vector widths over the past decade.
Figure 1: Ten years of double-precision vector width increases for three common CPU architectures
I expect this trend to continue into the future, despite the limitations and challenges of vectorization. Therefore, I posit “Vose’s law”: The vector width of general-purpose CPUs will double roughly every five years. This trend of more threads and longer vectors will continue well into the future, as it undoubtedly helps certain classes of codes. However, the proportion of codes that benefit from this trend will decrease.
More complex memory hierarchies
Faster cores with support for more threads and larger vectors are great for some classes of codes, but memory bandwidth has not kept up with historical increases in CPU speed. It is increasingly common for application performance to be bandwidth-limited. Future hardware addresses this issue with an upgraded and more complex memory hierarchy. It will not be uncommon to see systems with a memory hierarchy starting with L1 through L4 caches, followed by both an HBM and DRAM, and finally ending with some form of non-volatile memory such as Intel’s recently-announced XPoint. All of this can be placed at the top of the hierarchy, above any SSD burst buffers or traditional spinning disks and tape. Some classes of codes will have a make-or-break relationship with this complex memory hierarchy.
How can you be prepared for the future?
While future hardware may require many threads and large vectors to reach peak theoretical FLOPs, this doesn’t mean adding more threads or extending vectorization is the correct path forward for your code. Keep the complex memory hierarchy in mind as well. How do you know where to start your software optimization efforts? The key is to focus on your application’s memory behavior.
Begin by considering your application’s computational intensity — the number of operations performed per array access. If it’s high enough, new hardware may provide easy performance gains. However, many codes have neither a large computational intensity nor a large amount of data re-use. These applications are not cache-friendly and will instead be limited by available HBM or DRAM bandwidth. For these codes, the key is to optimize for the memory hierarchy.
When working with a vector-amenable code, you shouldn’t limit yourself to the existing implementation. Be sure the code exposes as much of the underlying algorithmic parallelism as possible; change data structures and loop nests as required. Additionally, don’t rely on explicit threading or vectorization via OpenACC or OpenMP-4.0 directives, and don’t fight the compiler by manually blocking or unrolling loops. Instead, to future-proof your application, write code so that a good compiler can clearly see the available parallelism. A quality compiler like Cray’s can vectorize your code for you and can also perform blocking and loop unrolling in a way that is appropriate for the target hardware. Don’t fight the compiler; try to stay out of the way. Help the compiler do what it does best: optimizing your code for the target hardware.
While some HPC codes can achieve gains from the continuation of current thread and vector trends, an ever-increasing number will instead need to focus on memory bandwidth and the increasingly complex memory hierarchy of future hardware designs. Is your code ready for the future? Start optimizing it today to avoid being caught off guard tomorrow.