Vectorizable Vs. Vectorizability: The Big Difference

Is your application vectorizable? Today, users generally answer that question by compiling with and without vectorization turned on and then looking at the difference in runtimes. While this is easy to do, it is not appropriate since it only answers the question:  “Is my application vectorizable” with the current compiler? The question to be addressed is whether the important looping structures in the application can be restructured to be vectorized. Thus, “What is the vectorizability of the application?”

In fact, we will not be dealing with the traditional vector instruction that existed on Cray systems delivered in the ’80s. What we have today are functional units that look more like the even earlier Illiac IV —how many of you remember that supercomputer?

The Illiac IV had a single instruction, multiple data (SIMD) functional unit. (An interesting note: For some time the Illiac would not work if the door to one of the processor units was closed. Once during a DARPA review, someone closed the door and the machine crashed.)There was a control unit that decoded the instruction and issued the instruction, and then there were 64 processors that all had to be doing exactly the same thing at the same time. The vector units in the early Crays were pipeline units where the time to update a set of data n long was

            Time for vector = startup time (length of pipeline) + n * result rate

While the time to do the Illiac parallel instruction used to be:

                        Time for SIMD = [n/64] * the time do the operation

If n is 1 to 64, one must do one operation. If n is 65 to128, two operations must be performed. The SIMD functional units on the NVIDIA® accelerator, Intel® Haswell and Knight’s Landing are of length 32, four and eight respectively. With the advent of Broadwell, the SIMD length of the Xeon® nodes will be eight. So within the next year the minimum performance enhancement from vectorization will be a factor of eight.

Can application developers afford to throw away this amount of performance? Short answer: No. You need to investigate the vectorizability of your application, which may lead you to use vector instead of SIMD because, well, it’s hard to say SIMDization.

The translator between the input program and the SIMD units on the target architecture is the compiler. As I like to say, “Never, ever trust the compiler to do the right thing.”

At a conference last month, a speaker said, “I expect the compiler can inline the function calls and generate vector instructions for my while loop.”

While the rules for vectorization of a looping structure are simple, delving into the details of the coding within a loop is extremely difficult. With so many edge conditions where vector instructions may generate incorrect answers, the compiler must be certain that vectorizing the loop will generate correct answers.

The situation is even worse for C and C++ because of the extensive use of pointers in those languages. In Fortran the compiler assumes that pointer references do not overlap, and in C and C++ it assumes they do. Potential overlap of pointer references can result in loop carried dependencies — the No. 1 inhibitor of vectorization in a loop.

Explaining loop-carried dependencies can be simplified with the following example:

DO I = 2, N-1

A(I) = SQRT(A(I+K)**2 + B(I+K)**2)

ENDDO

 

If K is less than 0 this loop has loop-carried dependencies. If it is greater than or equal to 1, it does not have loop-carried dependencies and can be vectorized. We refer to I+K as an ambiguous subscript. Earlier in the routine there may even be an expression setting K, and it may be obvious to the user that it should be positive; however, the compiler may still reject the loop. The traditional approach to forcing the compiler to vectorize the loop is to place a !DIR$  ivdep prior to the loop.

While the detection of loop-carried dependencies is the task confronting the compiler, there can be numerous complications in the loop that inhibit the compiler from vectorizing the loop, such as:

  • Subroutine and/or function calls within the loop
  • Complex decisions that inhibit the compiler from performing dependency analysis
  • Indirect addressing and striding that could result in poor performance
  • Early EXIT or CYCLE from a loop
  • In a multiple loop nest, the compiler needs to determine which of the loops to vectorize

The challenge now facing the application developer is to write looping structures in such a way so the compiler does the right thing. That speaker at the conference probably felt that the compiler’s ability to vectorize looping structures must be so advanced that nothing should inhibit a loop from being vectorized.

Unfortunately, there are limits on what the compiler can do even with good ol’ Fortran 77. And the latest language extensions in Fortran and C++ make the compiler’s  vectorization task even more difficult.

Let’s re-examine the speaker’s quote in a little more detail:

“I expect the compiler can inline the function calls and generate vector instructions for my while loop.”

Let’s be frank. The compiler does not automatically inline subroutines and functions. There are flags the user can use on the compile command to force it. However, this is a very heavy hammer. The second part of the quote that is troublesome is that a while loop implies undetermined exit from the loop. This is an issue, since some compilers need to understand the extent of the loop prior to executing the loop.

As I recommended to the speaker at the conference, application developers should look at vectorization as they do MPI and/or threading on the node. You wouldn’t want the compiler to automatically identify when and where to implement MPI, and I hope that you wouldn’t expect the compiler to automatically generate good threading on the node.

In my next post, I will examine some examples of important looping structures that defy compiler vectorization analysis and how they may be reposed to allow the compiler to generate vector instructions. Stay tuned, and feel free to contact me with any questions or suggestions for future blog posts: Email me at [email protected] or visit www.hybridmulticore.com.

 

Comments

  1. 3

    says

    Hi John,
    Glad to see your reference to the Illiac IV.

    Myself and a few others are attempting to recreate some of these old systems in software as modern emulators. So far we have Burroughs B205, B5500 and a partial B6700 functioning as an emulator.

    I am hoping we can do the same for the Illiac IV.

    The main thing we would like to find though is software, either the B6700 support software for the Illiac or examples of software written for it – CFD routines or similar libraries.

    For the other machines we have had variable success in locating software, sometimes resorting to re-typing listings, but in the case of the B5500 one of us managed to locate and recover a complete set of 7-track system tapes.

    If you have or know of anyone with listings, tapes, photos or other information it would be much appreciated.

    thanks,
    nigel.

    ASK – assembler for the Illiac IV on the Burroughs B5500 and later B6700
    SSK – simulator for the Illiac IV on the Burroughs B5500 and later B6700
    CFD – a FORTRAN like language targeted at CFD
    Glypnir – ALGOL like language, likely not used at Ames
    VECTORAL – vector processing language used at Ames

    People contacted so far:
    John Day
    David Grothe
    jody Kravitz
    Gary Grossman
    Gary Feierbach
    April Gage
    Bob Rogallo
    Alan Wray

  2. 4

    says

    Great to hear from you – been 15 years. Going forward the hardware will have better features to vectorize IF statements and there will definitely be cases where performance gains can be obtained. If the IF is really sparse there could be a problem as you indicate. It really comes down to the problem being addressed, of which only the user only knows.

  3. 5

    Zaphiris christidis says

    hey John, I see that you are still going strong. I could not agree with you more on your statements above. Compilers and tools have come long ways, but the application developers bear the responsibility to write code that can be easily digested by the compiler. But there are still difficulties as we all know. I would not expect the compiler would understand and vectorize do-loops with conditionals. Too many factors at play. The compiler has a picture of the code structure before execution, but cannot foresee the different scenarios (or break-even points) in performance characteristics after execution. If things were that simple, there would not be any folks developing high-performance libraries.

Speak Your Mind

Your email address will not be published. Required fields are marked *