In 2003 Cray signed a contract with Oak Ridge National Laboratory (ORNL) for the installation of a Cray® X1™ supercomputer. This contract led to the creation of a program that continues to support Cray research and development worldwide.
The agreement with ORNL included funding for a group of Cray experts to help the U.S. Department of Energy’s Office of Science researchers port and optimize their applications for the new system. This group was called the Cray Supercomputing Center of Excellence (COE).
The mission of the Center of Excellence was multifaceted:
- Assist the DOE’s Office of Science researchers in porting their application from their existing IBM system.
- Train the researchers and members of ORNL’s Scientific Computing group in how best to utilize the system.
- Refactor important applications to improve their performance.
- Communicate DOE requirements to Cray R&D to ensure that future software releases satisfy ORNL’s requirements.
- Give the DOE early experience on future generations of systems by porting and optimizing their applications on early in-house prototypes.
The COE was extremely successful, and the initial installation of the Cray X1 system was leveraged to solve many of the DOE’s most challenging scientific problems. For the next 10 years, Cray’s Center of Excellence at ORNL assisted the DOE Office of Science researchers in moving from the Cray X1 through the Cray® XT™ series of systems. The upgrades culminated in the “Jaguar” system, at the time the world’s fastest supercomputer, and “Titan,” a Cray® XK6™, the first massively parallel system to employ graphics processing units. It too had a turn as the world’s fastest supercomputer.
ORNL earned renown as a premiere leadership computing facility. Along the way, members of Cray’s COE at Oak Ridge contributed to two projects that earned the Gordon Bell Prize, which recognizes outstanding achievement in HPC.
Centers of Excellence expand
The success of this first effort at providing Cray experts at customer sites resulted in the creation of COEs in Edinburgh, Scotland, and in Seoul, South Korea. Those were joined by COEs at the National Energy Research Scientific Computing Center (NERSC) in Berkeley, Calif., and at Los Alamos National Laboratory in New Mexico. COEs at research labs such as these provide Cray expertise to a wider range of the HPC community.
The Cray Centers of Excellence provide an important exchange of expertise between customers and the company’s R&D team. One example comes from work our COE team performed with ORNL researchers on the Parallel Ocean Program (POP). An ORNL researcher noticed that the application was not scaling as high as desired and that the lack of scaling was due to load imbalance coming from a computation that should not have been imbalanced. After much testing and discussion with Cray engineers, we found that the operating system was causing delays in some processing elements. This one application was then employed to identify several operating system processes that were stealing cycles from the application and thus introducing load imbalance. This work resulted in the elimination of several OS processes to produce an operating system that gave the application priority for using the processors.
Prior to the delivery of the Jaguar system, members of the COE employed a new approach for optimizing the performance of HPL, the speed test program used for submitting systems to the TOP500 list. Auto-Tuning was used to arrive at the best matrix-blocking factors and runtime parameters for the HPL run. This effort achieved an extremely high percentage of peak on the largest x86 massively parallel system of the day. The total run was 27 wall-clock hours, since the matrix was so large.
In another project, the COE refactored one of ORNL’s principal applications — S3D, a combustion code — to run on the new Titan system using the newly released OpenACC compiler. The work, conducted over two years prior to the Titan installation, allowed for Cray development to test and debug the recently developed tools and the new OpenACC implementation. Thanks to the COE’s access to a real-world OpenACC application, Cray’s programming environment for Titan was production quality on delivery.
Thanks to partnerships like these with our customers, Cray’s COEs continue to inform our work in advancing HPC technology for all users.