What do you do when the performance of the MPI program that distributes your work across many cores and nodes starts to level off? Like any good programmer, you enlist the help of your favorite performance profiler to understand what’s going on. You find that you are now network bound, or that the MPI memory footprint has grown so large that there is little room left for your program. Assuming you still have plenty of work to do, how can you modify your program so you can continue to scale?
One solution is to introduce another level of parallelism within the node. Save MPI for communication between nodes or between NUMA regions, and use threading within a node. When MPI levels off, moving to a hybrid programming model like MPI + OpenMP reduces the number of MPI ranks needed per node, reduces the memory footprint consumed by MPI, and reduces network injection issues. The process of moving from a pure MPI program to a hybrid MPI + OpenMP program can be daunting, however, as it is tedious and error-prone work.
If you want to have the best chance of seeing a performance improvement when you are done, which loops should you parallelize? You will want to pick loops with enough work to overcome the overhead associated with starting and managing multiple threads. These are probably going to be the big loops, with perhaps several thousands of lines of code. They often contain several inner loops and function calls that you’ll have to wade through to determine which variables should be shared between threads and which should be private to a thread. You have to worry about tracking variables as their names change when passed through functions, and if/how a loop is accessed from multiple places within the program.
After performing this task many times at Cray, we decided to develop a tool to assist with this process.
Cray Reveal™ shows you a ranked list of important loops and relevant optimization inhibitors. It lets you browse your source code, get explanations for some of those more confusing optimization messages produced by the compiler, and, with a click of a button, analyzes all of the most relevant loops in your program and presents parallelization success or failures for your review. Loops are flagged green if no dependencies were found, and Reveal can automatically build and insert OpenMP directives into your source for these loops. A loop is flagged red if Reveal finds issues that require user intervention.
Reveal provides insight into problems encountered when scoping variables. For example, it will point out a recurrence on a shared variable down the call chain that needs protecting with a critical section, or it will ask for your confirmation that an array does not overlap with other objects before its proposed scope is valid. You only need to focus on the variables and issues that Reveal could not handle during its analysis. There may be hundreds of variables within a loop candidate, and reducing the analysis to a handful of issues can dramatically reduce the amount of effort required to parallelize a loop.
In addition to inserting directives for all parallelizable loops (those that were flagged green), Reveal can insert OpenMP directives for loops with unresolved issues. For these loops, an illegal “unresolved” OpenMP clause is included in the directive to identify all of the variables that need your assistance. The rest of the directive is valid. This clause serves as a reminder that your parallel directive is incomplete (as it will not compile), and can be used as a “work list” for resolving issues using your favorite editor.
A common question asked by our users is whether Reveal must be used with the Cray compiler. The answer is yes. Reveal needs both CrayPAT™ and the Cray Compiler Environment (CCE) to help you parallelize loops. However, the result is portable OpenMP directives in your source code, and you can subsequently build your program with any compiler that supports OpenMP.
Reveal can reduce the time it takes to add parallelism to your program using OpenMP directives. Analysis that would have taken potentially weeks to get correct by hand can be performed in a few minutes. So, next time you are faced with a situation where you can no longer scale your program because of MPI, we hope you give Reveal a try.
To learn more about Reveal, please watch this video recording, “Adding Parallelism to HPC Applications using Reveal,” from the recent DOE Programming Workshop. Or, to learn about recent enhancements to the Cray performance tools, please watch “Cray Performance Tools Enhancements for Next Generation Systems,” from the recent 2016 Cray Users Group (CUG) Conference.