In a recent blog post, David Wallace discussed the importance of OpenACC and how it is impacting the industry as a whole. Here, I will be focusing on the recently released OpenACC 2.0, its features and benefits, primary purpose, and discussions on the next version.
Why release a second version of the OpenACC specification?
Whenever something is done in a committee, compromises are made. In the case of OpenACC 1.0 most of the compromises were related to what the technical committee felt could reasonably be designed in the given timeframe. There were already four different mechanisms for programming accelerators provided by the initial four members of OpenACC. So the technical committee took the pieces that it liked from the two directives sets that looked the most alike, PGI’s accelerator directives and Cray’s OpenMP extensions for accelerators and merged them into one cohesive document. In some cases, the committee decided to leave features out rather than confuse the reader with features that were not well thought out. In other cases the committee had new ideas that it wanted to add but it just did not have time to finish. Finally, as with almost all specifications, there are bound to be errors in the document which must be corrected. These errors forced at least a 1.1 release to address the errors.
Where did the additions to the spec come from?
As I mentioned in earlier, there were several features that could not be completed for the initial release of the specification. Many of these features made the core of the technical committees initial working set for the next major release. Along with the existing working set, all of the vendors were learning from using and interacting with users of the directive set. Whenever a new feature was identified it was added to the list of potential features for the next release.
What is the primary “purpose” of this release?
It is hard to say that there is just one primary purpose of the 2.0 release. Originally, the committee started working on a 1.1 release which was just a fix to the 1.0 release. The committee quickly discovered that there were several obvious additions that users needed very quickly. These additions fall into two primary categories – programming model “fixes” and user features.
To name a single purpose for the release: it was to provide the programmers with additional tools to enable them to write codes that ran well on the target architecture in a highly productive manner.
What was added to the spec?
Several features where added to the specification. Two very important features where driven by the Nvidia® CUDA™ features. The first is support for separate compilation units. The other feature was nested or dynamic parallelism. Function calls have long been a corner stone of any significant programming language. However, because of limitations in some accelerator programming models function calls were not supported in the 1.0 specification. The expectation was that once compilers had exploited inlining to its utmost potential, then implementations would be forced to deal with the lack of linkers for some targets. Some work was started on this path and then suddenly the problem became less important – as all of the important, at least to the technical committee, architectures suddenly had linkers. Dynamic parallelism is a corner stone for parallelism which is working on unbalanced data sets. The initial release of OpenACC ignored this set of problems intentionally. In 2.0 the technical committee was pressed to rethink this position.
Were there features that were not driven by an architecture?
Three more features that come most readily to mind are: unstructured data lifetimes, loop tiling and enhanced asynchronous execution. Unstructured data lifetimes are critical for C++ programmers who use classes to “hide” data. When data is compartmentalized it is not always possible to use structured data construct directives to move objects to the accelerator. Many programmers complained that they had to “expose” the internal data structures of their classes to use the existing data constructs. Unstructured data lifetime allows programmers to place the directive that moves an object to the accelerator in a constructor, and then place the directive that moves the object off of the accelerator into the destructor. This allows the programmer to move objects at a high-level to exploit data locality without destroying the structure of the program.
Loop tiling was added to the spec to provide the programmer with an accepted mechanism for exploiting complex block architectures. Very quickly after all three vendors released OpenACC compliant compilers, it was discovered that some vendors allowed programmers to exploit multiple instances of the same level of parallelism. This was usually done at the gang level to exploit the multi-dimensional thread blocks available on Nvidia architectures. Unfortunately, this was an extension to the specification that not all vendors adopted. In discussions, the technical committee decided to adopt a natural ordering constraint, only one level of gang, worker and vector parallelism may be expressed. To provide the power of the multi-dimensional thread blocks, the tile clause was added to loops so that compilers could either exploit multi-dimensional thread blocks or utilize cache blocking and vectorization on loop nests in “new” more powerful ways.
Asynchronous execution between the host and the accelerator was always seen as important. However, while working on porting several codes it was found that some very minor additions to the specification could provide the last bit of performance from the directive version of the code to reach desired performance compared to CUDA versions of the same code.
Will there be another version?
The technical committee has already identified several features that it is working on for the next release. The largest feature that is being primarily driven by Cray is “deep copy”, the movement of complex pointer based arrays and structures. This topic was identified as an important feature which will likely be the feature that dictates when the next major release is available. Due to the size and complexity of this feature, it may take a significant amount of time to complete the design work.
When will the next version be release?
A maintenance release was actually release several months ago. This release was number 2.0a and contained clarifications and corrections to the specification only. Features for the next major release are currently being discussed when these features are ready for release is for the technical committee to decide. However, the underlying theme for the next release will be productivity and ease of use.
James Beyer, Cray Compiler Engineer