Similar to the old adage “you cannot judge a book by its cover,” estimating the performance of an application on thousands of nodes of a massively parallel computer cannot be done by investigating the performance on a single node or, for that matter, 100 nodes.
For example, the performance of an application is dependent upon disjointed operations, some that scale and others that don’t. As an experiment, let’s say one operation takes 99% of the time when running 32 MPI tasks on one node and it scales as the number of MPI tasks increases. Another operation takes 1% of the time on one node; however, it doesn’t scale at all. On 1,000 nodes of 320,000 MPI tasks the first operation would take .000099% of the time and the second operation would take 99.990001%. It is somewhat nonsensical to think about optimizing an application on one node if the problem of interest needs to run on thousands of nodes.
Recently, while working with the Eulerian Applications Project (EAP) at Los Alamos National Lab (LANL), we were able to gather runtime statistics of a production run of interest on 4,000 nodes using 64 MPI tasks/node or 262,144 MPI tasks on the Intel KNL partition. The bottleneck that caused the application’s degraded performance at scale was an operation that used very little time when less than 100 MPI tasks were used, and the only way to understand the performance of the application was to investigate its performance at scale.
The performance tool we used to gather these statistics was CrayPAT™, one of many elements of the Cray Performance Tools suite, which is useful at large scale. Performance was measured on production scale runs of 500 to 4,000 Intel KNL nodes. The application was not scaling; however, the advantage of CrayPAT is its ability to indicate that the operation using most of the time was a portion of the adaptive mesh refinement (AMR) where MPI tasks needed to determine who would be their neighbors in the next time step. This operation was taking 65% of the time on 262,144 MPI tasks running over 4,000 Intel KNL nodes. The lines of code utilized in this section were small, and they were able to be replaced with MPI-3 RMA one-sided operations that essentially eliminated the time.
The following chart shows the results of running the original and optimized application on a problem simulating an asteroid striking the earth.
The chart shows weak scaling from the range of 500 to 4,000 nodes where 64 MPI tasks were used on each node. Weak scaling adjusts the size of the problem as a function of the number of MPI tasks used. The time to solution should be a straight line for perfect scaling. As we see, the original code did not scale, while the optimized code is scaling very well.
Looking at the performance of an application at large scale is difficult without tools that operate at scale. In this example, the optimization that resulted in a factor of 2.73 for the overall application was not difficult to obtain if one knew what part of the application caused the poor scaling. In this case the overall application was well over a million lines of Fortran, and the code that was optimized was less than 100 lines.