The Other P is Destroying the Ability to Achieve Performance Portability

An important consideration when moving to the next generation of multi/many-core systems is striving to create a refactored application that can run well on available systems.

There are numerous similarities between the multi/many-core systems of today. Today’s systems have very powerful nodes, and the application must exploit a significant amount of parallelism on the node, which is a mixture of MIMD (multiple instruction, multiple data) and SIMD (single instruction, multiple data).

The principal difference is how the application utilizes the MIMD parallelism on the node. Multicore Xeons and many-core Intel Phi systems can handle a significant amount of MPI on the node, whereas GPU systems cannot. There is also a difference in the size of the SIMD unit. The multi/many-core systems’ vector units are less than 10, and the GPU is 32. Since longer vectors on the multi/many-cores systems do run better than shorter vectors this is less of an issue. All systems must have good vectorized code to run well on the target systems. But there is a problem:

“Software is getting slower more rapidly than hardware becomes faster.”
–Niklaus Wirth, 1995

The “other p”
There is a trend in the industry that goes against creating a performant portable application: the attempt to utilize high-level abstractions intended to increase the productivity of the application developers. The movement to C++ over the past 10 to 15 years has created applications that achieve a lower and lower percentage of peak performance on today’s supercomputers. Recent extensions to both C++ and Fortran to address productivity have significantly contributed to this movement. Even the developer of C++, Bjarne Stroustrup, has indicated that C++ can lure the application developer into writing inefficient code:

“C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do it blows your whole leg off.”
“Within C++, there is a much smaller and cleaner language struggling to get out.”
–Bjarne Stroustrup

The productivity argument is that the cost of talented labor is greater than the cost of the high performance computing system being utilized, and it is too time consuming to improve the performance of the application. On the other hand, time spent optimizing application performance not only makes better use of expensive machines, it also reduces operational costs as energy costs continues to rise for future systems.

Several years ago, a team lead by Thomas Schulthess and Oliver Fuhrer of ETH Zurich refactored the production version of COSMOS, the climate modeling application used by MeteoSwiss, and found that not only did the application run significantly faster on their current GPU system, the cost of the effort would be more than repaid by the savings in energy costs over a couple years. The charts below show the performance increase and power consumption decrease for the refactored application.

The work on COSMOS employed C++ meta-programming templates for the time consuming dynamical core, which resulted in the instantiation of CUDA kernels on the GPU and optimized assembler on the x86 systems. This work is an example of the developers understanding the architecture and restructuring the application to utilize its features with the high-level C++ abstractions.

The bulk of the code – the physics – was Fortran, and OpenACC was used for the port to the accelerator. This is an excellent example that shows how an investment of several person-years of effort can result in an optimized application that more than pays for the investment in the development cost

the-other-p

Data motion is extremely expensive today and will be more expensive in the future, both in energy and time, and many of the high-level abstractions in the language can easily introduce excessive memory motion in an application. Retrofitting a large C++ framework with a high-level abstraction requires application developers to move data into a form acceptable to the abstractions and/or refactor their applications to have the abstraction manage their data structures. The first approach introduces too much data motion, and the second approach often requires a significant rewrite. Once such a rewrite has been performed, the application is dependent upon those interfaces making efficient use of the underlying architecture. Additionally, most complex multiphysics applications are a combination of computations that flow from one set of operations to another, and breaking that flow up into calls to the low-level abstractions could result in ineffective cache utilization and increased data motion.

Much of the blame of this productivity movement has to be placed on the Language Standards Committee that introduces semantics that make the application developer more productive without thinking about compilation issues. Considering the recent additions to the language, both Fortran and C++, it seems that the Committee really does not care about how difficult it might be to optimize the language extensions and that their principal goal is to make programmers more productive. When users see an interesting new feature in the language, they assume that the feature will be efficient; after all, why would the Committee put the feature in the language if it wouldn’t run efficiently on the target systems?

At some point this trend to productivity at the expense of performance has to stop. Most, if not all, of the large applications that have taken the productivity lane, have implemented MPI and messaging outside of the abstraction, and they have obtained an increase in performance from the parallelism across nodes. Increased parallelism must be obtained on the node in the form of threading and/or vectorization, with special attention paid to minimizing the movement of data within the memory hierarchy of the node. At some point, application developers have to put in extra work to ensure that data motion on the node is minimized and that threading and vectorization are being well utilized.

Comments

  1. 5

    says

    A minor correction: it is the COSMO model, not COSMOS. The latter is a different model with similar name (it’s common to mistake the two).

    But naturally the COSMO-POMPA group which is responsible for this development is thrilled about the positive PR.

    –Will

  2. 7

    JOE says

    I don’t really get the goal of this article.

    It seems like it is to discourage application programmers from using C++ and modern language standards and to say that we should be trying to achieve higher percentage of peak of current large-scale systems… without giving any examples of things to change or suggestions of how to change. The only specific example of anything cited in this article is that of COSMOS, which used C++ and got positive results. So where are the bad results: are there applications that have rewritten to C++ and performed much worse, or specific features of C++11/14/17/whatever that are to blame? Are those features just being misused or are they inherently bad? I don’t doubt that inefficient code is run on supercomputers all the time, but this article does nothing to a) define “good enough” performance, b) identify real examples of bad performance, and c) suggest how to improve performance in any specific way.

    I find the claim that “The movement to C++ over the past 10 to 15 years has created applications that achieve a lower and lower percentage of peak performance on today’s supercomputers” not only extremely contentious but even damaging to the HPC community. First, I think percentage of peak FLOPS is a bad metric to judge all applications by. Second, I don’t think C++ and newer standards are to blame for poor application performance. There are many factors, like increased scale of parallelism, increasingly complicated architectures, and the existence of different performance bottlenecks in different applications and on different systems. Modern C++ and Fortran actually provide better support for shared-memory parallelism and improved mechanisms for writing generic code that can be specialized for different architectures. And, again, if you’re right that new standards are to blame, what should we do about it?

    • 8

      says

      Prior to the advent of GPU systems and KNL, application developers had a free ride just using MPI and scalar processing, benefiting from the increased number of cores on the node. Today we have a situation similar to the movement to distributed memory programming which required the incorporation of message passing. Now, to best utilize all the hardware threads (including Hyperthereads) applications have to be threaded and to harness the vector processing capability the applications must vectorize. Given the tremendous movement to C++, away from the traditional HPC languages Fortan and C, the modifications to achieve vectorization and threading are a tremendous challenge. Unless the developers really accept the challenge and refactor their codes to utilize threading and vectorization they will remain in the gigaflop performance realm and realize little improvement in the new HPC systems. The cited reference to the COSMOS weather code is an example of how that challenge can be realized with some hard work.

      • 9

        says

        Here is a reference to the COSMOS work

        O. Fuhrer, C. Osuna, X. Lapillonne, T. Gysi, B. Cumming, M. Bianco, A. Arteaga, T. C. Schulthess, “Towards a performance portable, architecture agnostic implementation strategy for weather and climate models”, Supercomputing Frontiers and Innovations, vol. 1, no. 1 (2014), see superfri.org

Speak Your Mind

Your email address will not be published. Required fields are marked *