Characterization of an Application for Hybrid Multi/Many-core Systems

In my last blog post, I illustrated a paradigm shift in chip technology and how node architecture must significantly change to support these trends. In this post, I will discuss a way to characterize applications for investigating approaches to moving forward into the world of multi- and many-core hybrid systems.

When porting/optimizing an application to the new generation of hybrid multi-/many-core systems, one must identify enough parallelization to utilize two levels of parallelism on the node. Whether the node is an accelerated node or a many-core node, a significant amount of threading on the node is a requirement for good performance. The days of using messaging across all the cores on a node are over. Additionally, the lower-level loops must be long enough to ensure performance gain from vectorization.

Before starting any work on refactoring an application, the user must understand the application in detail. Characterizing an application involves gathering instrumentation data from a series of computations and using that data to formulate an approach to refactoring the application to achieve the best performance on hybrid multi-/many-core systems.

First, some guidelines for gathering this data:

  1. The problems that are used in the instrumented runs must be important and large enough to represent a future science problem to be solved on the target system.
  2. The problem should be run long enough to identify the major computational trends.
  3. The problem should be run on a significant number of nodes to account for problem sizes at scale.

Second, what type of data needs to be collected?

  1. The instrumentation should focus on high-level looping structures.
    1. What is the minimum, maximum and average iteration count for each high-level loop?
    2. What arrays are read, written or read/written in the looping structures?
    3. Do the high-level loops do message passing?
    4. Can we steal some parallelism from the MPI?
  1. What portions of the application are exercised in the problems being addressed?
    1. It is important to understand that on a very large multidiscipline application, some large sections of code may not be utilized. This will be particularly true when trying to manage arrays — some arrays may never be used in certain computations.
    2. The initialization of the problem should not be included in the analysis unless it takes a significant percentage of the wall time in a long running solution.
    3. I/O must be considered along with the computation. Management of the arrays that are utilized in the computational section may affect how the I/O is managed.

Cray’s Programming Environment team has developed a suite of tools called Apprentice for gathering the statistics required for this analysis (see displays below). Cray Apprentice’s tools include:

  • CRAYPAT™, which can instrument all the looping structures within an application and can display call tree information containing loops and routines.
  • Reveal, which can perform scoping analysis of high-level looping structures containing call to subroutines and function — providing whole program analysis.

In addition, Cray is investigating memory tools to help users manage data layout.

diagram1

 

diagram2 (1)

Consider the following example of output (Figure 1) from an instrumented run using CRAYPAT. The main routine, vhone, calls sweepy, which has two loops. Within those loops sweepy calls ppmlr, which calls remap, which calls parabola. Then after returning from remap, ppmlr calls riemann, parabola, evolve and states. Then after the call to sweepy, there are some MPI calls.  In Figure 2 we get the information on the loop iteration counts.

Once this data is collected, a thorough analysis is required to identify where threading can be used on the node and whether low-level looping structures can be vectorized. High-level threading will be necessary to scale to appropriate threading levels on the target architecture. High granularity is required to overcome the overhead of instantiating a threaded region and to allow for sophisticated scheduling of threads to minimize load imbalance. This high-level threading will require a complex scoping analysis to compartment the variables into shared by all threads or private to each thread. This analysis may identify loop-carried dependencies that must be resolved to allow for parallelization of the loop.

Next, low level looping structures must be investigated for vectorization and potentially another level of threading for employing hyper-threading of the cores. Analysis of the potential vector loops will involve more than determining whether the compiler vectorizes the loop. In a majority of instances, the loops can be restructured to achieve vectorization. This is somewhat of a lost art and information about vectorizing loops can be found in papers and books from the 1980s.

In a Figure 2, we obtain the minimum, maximum and average trip counts of the loops. From this information we can quickly see that the loops within sweepy are excellent candidates for threading. The lower-level loops will be good candidates for vectorization. This characterization is a top-down approach, and it is paramount for moving to a multi-/many-core system.

diagram3

Diagram4

The Cray Compiler has a special symbol to indicate that a loop vectorizes:

Smiley_Compiler

…So whenever you get a smiley face with vector eyes, you know your loop vectorizes. (Actually this RFE is in the works; you may only get a V in the left column of the listing.)

Lastly, array management should be examined closely. In the next three to four years, if not sooner, memory hierarchies will require that data movement be minimized. This can be achieved only by organizing the memory in such a way that data can be kept local and/or prefetched prior to a major computational kernel.

In my next post, I will discuss the basics of vectorization. The compiler applies rules to a loop to determine whether it can be vectorized, but some programming practices defeat the compiler’s analysis. This is really why most loops are not vectorized.

Comments

  1. 1

    ali says

    Good, but I am still not sure why can’t we design special MPI libraries that intelligently/efficiently handle communication between cores in a node? Then the communication costs will be the same as using threads and we don’t need to change the code at all.

    • 2

      says

      Great Question

      We already have those efficient libraries, that is the reason why all_MPI has been working so well on nodes that have 16-24 cores. There are other reasons why all-MPI will have issues going forward.
      1) First all-MPI users more memory since all common data must be replicated which is not the case with threading.
      2) All-MPI will not be portable since you cannot run an all-MPI application on a system with an accelerator
      3) All_MPI will overload the NIC (Network Interface Controller) which has to keep track of all message on and off the node to perform tag matching
      4) MPI actually results in more memory transfers even with efficient MPI libraries. Most users use MPI derived data types that move data to a buffer and then upon the receiver end, the data is unpacked. With threading you can use direct memory copies thus reducing memory traffic
      So bottom line is that with our next generation of hybrid Multi/Many core systems you are going to have to rewrite if you want performance

Speak Your Mind

Your email address will not be published. Required fields are marked *