Is your application vectorizable? Today, users generally answer that question by compiling with and without vectorization turned on and then looking at the difference in runtimes. While this is easy to do, it is not appropriate since it only answers the question: “Is my application vectorizable” with the current compiler? The question to be addressed is whether the important looping structures in the application can be restructured to be vectorized. Thus, “What is the vectorizability of the application?”
In fact, we will not be dealing with the traditional vector instruction that existed on Cray systems delivered in the ’80s. What we have today are functional units that look more like the even earlier Illiac IV —how many of you remember that supercomputer?
The Illiac IV had a single instruction, multiple data (SIMD) functional unit. (An interesting note: For some time the Illiac would not work if the door to one of the processor units was closed. Once during a DARPA review, someone closed the door and the machine crashed.)There was a control unit that decoded the instruction and issued the instruction, and then there were 64 processors that all had to be doing exactly the same thing at the same time. The vector units in the early Crays were pipeline units where the time to update a set of data n long was
Time for vector = startup time (length of pipeline) + n * result rate
While the time to do the Illiac parallel instruction used to be:
Time for SIMD = [n/64] * the time do the operation
If n is 1 to 64, one must do one operation. If n is 65 to128, two operations must be performed. The SIMD functional units on the NVIDIA® accelerator, Intel® Haswell and Knight’s Landing are of length 32, four and eight respectively. With the advent of Broadwell, the SIMD length of the Xeon® nodes will be eight. So within the next year the minimum performance enhancement from vectorization will be a factor of eight.
Can application developers afford to throw away this amount of performance? Short answer: No. You need to investigate the vectorizability of your application, which may lead you to use vector instead of SIMD because, well, it’s hard to say SIMDization.
The translator between the input program and the SIMD units on the target architecture is the compiler. As I like to say, “Never, ever trust the compiler to do the right thing.”
At a conference last month, a speaker said, “I expect the compiler can inline the function calls and generate vector instructions for my while loop.”
While the rules for vectorization of a looping structure are simple, delving into the details of the coding within a loop is extremely difficult. With so many edge conditions where vector instructions may generate incorrect answers, the compiler must be certain that vectorizing the loop will generate correct answers.
The situation is even worse for C and C++ because of the extensive use of pointers in those languages. In Fortran the compiler assumes that pointer references do not overlap, and in C and C++ it assumes they do. Potential overlap of pointer references can result in loop carried dependencies — the No. 1 inhibitor of vectorization in a loop.
Explaining loop-carried dependencies can be simplified with the following example:
DO I = 2, N-1
A(I) = SQRT(A(I+K)**2 + B(I+K)**2)
If K is less than 0 this loop has loop-carried dependencies. If it is greater than or equal to 1, it does not have loop-carried dependencies and can be vectorized. We refer to I+K as an ambiguous subscript. Earlier in the routine there may even be an expression setting K, and it may be obvious to the user that it should be positive; however, the compiler may still reject the loop. The traditional approach to forcing the compiler to vectorize the loop is to place a !DIR$ ivdep prior to the loop.
While the detection of loop-carried dependencies is the task confronting the compiler, there can be numerous complications in the loop that inhibit the compiler from vectorizing the loop, such as:
- Subroutine and/or function calls within the loop
- Complex decisions that inhibit the compiler from performing dependency analysis
- Indirect addressing and striding that could result in poor performance
- Early EXIT or CYCLE from a loop
- In a multiple loop nest, the compiler needs to determine which of the loops to vectorize
The challenge now facing the application developer is to write looping structures in such a way so the compiler does the right thing. That speaker at the conference probably felt that the compiler’s ability to vectorize looping structures must be so advanced that nothing should inhibit a loop from being vectorized.
Unfortunately, there are limits on what the compiler can do even with good ol’ Fortran 77. And the latest language extensions in Fortran and C++ make the compiler’s vectorization task even more difficult.
Let’s re-examine the speaker’s quote in a little more detail:
“I expect the compiler can inline the function calls and generate vector instructions for my while loop.”
Let’s be frank. The compiler does not automatically inline subroutines and functions. There are flags the user can use on the compile command to force it. However, this is a very heavy hammer. The second part of the quote that is troublesome is that a while loop implies undetermined exit from the loop. This is an issue, since some compilers need to understand the extent of the loop prior to executing the loop.
As I recommended to the speaker at the conference, application developers should look at vectorization as they do MPI and/or threading on the node. You wouldn’t want the compiler to automatically identify when and where to implement MPI, and I hope that you wouldn’t expect the compiler to automatically generate good threading on the node.
In my next post, I will examine some examples of important looping structures that defy compiler vectorization analysis and how they may be reposed to allow the compiler to generate vector instructions. Stay tuned, and feel free to contact me with any questions or suggestions for future blog posts: Email me at [email protected] or visit www.hybridmulticore.com.