An important consideration when moving to the next generation of multi/many-core systems is striving to create a refactored application that can run well on available systems.
There are numerous similarities between the multi/many-core systems of today. Today’s systems have very powerful nodes, and the application must exploit a significant amount of parallelism on the node, which is a mixture of MIMD (multiple instruction, multiple data) and SIMD (single instruction, multiple data).
The principal difference is how the application utilizes the MIMD parallelism on the node. Multicore Xeons and many-core Intel Phi systems can handle a significant amount of MPI on the node, whereas GPU systems cannot. There is also a difference in the size of the SIMD unit. The multi/many-core systems’ vector units are less than 10, and the GPU is 32. Since longer vectors on the multi/many-cores systems do run better than shorter vectors this is less of an issue. All systems must have good vectorized code to run well on the target systems. But there is a problem:
“Software is getting slower more rapidly than hardware becomes faster.”
–Niklaus Wirth, 1995
The “other p”
There is a trend in the industry that goes against creating a performant portable application: the attempt to utilize high-level abstractions intended to increase the productivity of the application developers. The movement to C++ over the past 10 to 15 years has created applications that achieve a lower and lower percentage of peak performance on today’s supercomputers. Recent extensions to both C++ and Fortran to address productivity have significantly contributed to this movement. Even the developer of C++, Bjarne Stroustrup, has indicated that C++ can lure the application developer into writing inefficient code:
“C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do it blows your whole leg off.”
“Within C++, there is a much smaller and cleaner language struggling to get out.”
The productivity argument is that the cost of talented labor is greater than the cost of the high performance computing system being utilized, and it is too time consuming to improve the performance of the application. On the other hand, time spent optimizing application performance not only makes better use of expensive machines, it also reduces operational costs as energy costs continues to rise for future systems.
Several years ago, a team lead by Thomas Schulthess and Oliver Fuhrer of ETH Zurich refactored the production version of COSMOS, the climate modeling application used by MeteoSwiss, and found that not only did the application run significantly faster on their current GPU system, the cost of the effort would be more than repaid by the savings in energy costs over a couple years. The charts below show the performance increase and power consumption decrease for the refactored application.
The work on COSMOS employed C++ meta-programming templates for the time consuming dynamical core, which resulted in the instantiation of CUDA kernels on the GPU and optimized assembler on the x86 systems. This work is an example of the developers understanding the architecture and restructuring the application to utilize its features with the high-level C++ abstractions.
The bulk of the code – the physics – was Fortran, and OpenACC was used for the port to the accelerator. This is an excellent example that shows how an investment of several person-years of effort can result in an optimized application that more than pays for the investment in the development cost
Data motion is extremely expensive today and will be more expensive in the future, both in energy and time, and many of the high-level abstractions in the language can easily introduce excessive memory motion in an application. Retrofitting a large C++ framework with a high-level abstraction requires application developers to move data into a form acceptable to the abstractions and/or refactor their applications to have the abstraction manage their data structures. The first approach introduces too much data motion, and the second approach often requires a significant rewrite. Once such a rewrite has been performed, the application is dependent upon those interfaces making efficient use of the underlying architecture. Additionally, most complex multiphysics applications are a combination of computations that flow from one set of operations to another, and breaking that flow up into calls to the low-level abstractions could result in ineffective cache utilization and increased data motion.
Much of the blame of this productivity movement has to be placed on the Language Standards Committee that introduces semantics that make the application developer more productive without thinking about compilation issues. Considering the recent additions to the language, both Fortran and C++, it seems that the Committee really does not care about how difficult it might be to optimize the language extensions and that their principal goal is to make programmers more productive. When users see an interesting new feature in the language, they assume that the feature will be efficient; after all, why would the Committee put the feature in the language if it wouldn’t run efficiently on the target systems?
At some point this trend to productivity at the expense of performance has to stop. Most, if not all, of the large applications that have taken the productivity lane, have implemented MPI and messaging outside of the abstraction, and they have obtained an increase in performance from the parallelism across nodes. Increased parallelism must be obtained on the node in the form of threading and/or vectorization, with special attention paid to minimizing the movement of data within the memory hierarchy of the node. At some point, application developers have to put in extra work to ensure that data motion on the node is minimized and that threading and vectorization are being well utilized.