July brought news of the launch of the new Intel® Xeon® Scalable Processors, previously known by the codename Skylake, and with it new features and capabilities that can bring performance enhancements to both legacy codes and new applications. This may leave you wondering what the best approaches are to getting the most out of these new workhorses. In this article, we’ll let you know about a couple of the key insights we’ve gained from running the benchmarks we’ve collected and optimized over the last four decades on the latest in the line of processors that are ubiquitous in high-performance computing (HPC).
HPC codes require balance
Parallel programming is about the balance between computation and data movement. Many of the benchmark codes we use at Cray to optimize systems are limited by memory bandwidth. This balance is even more important as processor core counts have increased to address the end of Dennard scaling and the subsequent impact on clock frequency. In HPC codes, cores and processor resources are enlisted to solve the same problem in parallel. This is very different from hyperscale data centers and use cases where resources are virtualized so different problems can share a single processor. Features designed for hyperscale customers can negatively impact HPC code performance.
The new Intel Xeon Scalable processors bring new tech to HPC
Four of the innovations in the new Xeon Scalable processor family are key to getting higher HPC performance. Firstly, Intel has given the socket 50% more memory controllers, bringing the total to six DDR4 channels while also bumping the rate to 2,666 MHz in the top two tiers of the offering. Secondly, wider vectors supported by AVX-512 enable higher efficiencies and more results per clock period. Thirdly, the top two tiers have two fused multiply-add (FMA) units, doubling the number of results per clock that can be achieved again. Finally, the top two tiers also have three high-bandwidth, coherent links to connect multiple processors together without additional hardware.
Pay attention to these 3 areas of data movement to get the most out of the Scalable processor
HPC codes have developed over several decades, and they’ve been optimized for the X86 architecture and cache hierarchy. Rolling to the latest Xeon is relatively straightforward, as the overall architecture is the same. There are some changes in the memory administration, however, and these can lead to some unexpected behavior. It is important to test and measure the single-thread memory bandwidth while optimizing code to run on the Scalable processor family.
Intel has included AVX-512 that debuted in the Intel® Xeon Phi™ line in the new Scalable processor family, doubling the vector lengths from 256 bits to 512 bits. Generally, this can lead to performance increases, but one should not assume that the wider vectors will always provide additional performance. There may be cases where targeting 256-bit vectors can outperform the 512-bit instructions.
The new processors have up to 28 cores per socket which, in the common 2-socket HPC node architecture, can lead to a NUMA domain of 56 cores and 12 memory controllers. We’ve found that many of the codes that are common to our customers can benefit from the balance provided by high-bandwidth links between the processors, leading us to add three links between processors on the Cray® XC50™ performance computer. As developers work to scale their applications to make use of the immense computing capability provided by the new Scalable processor family, they should pay attention to balancing the required compute with all the data movement on the node, including between the processors.
First principles of parallel programming are still the key
To the practiced application developer this may not seem like anything new, and that’s true. With the latest offerings, it’s even more important to focus on the first principles of parallel programming and maintaining the balance between compute and data movement. Keep in in mind that the details of how you get the best performance on your code may be a little different than what was required on the last generation.
Learn more about how Cray and Intel partner on leading high-performance computing technologies.