A few months back I wrote a blog on “Extreme Scaling in CAE Applications” which showed results for CFD applications scaling to over 10,000 cores. This blog will focus on the scaling performance of explicit structural analysis applications (often referred to as “crash simulation” applications). There has been increased emphasis and significant improvements in the scaling of the explicit structural codes. Because these codes are important across several industries, this work merits a separate post.
The most popular explicit structural analysis codes are Abaqus/Explicit, LS-DYNA®, PAM-CRASH®, and RADIOSS®. These applications are heavily used in the automotive industry for crash and safety analysis. They are also important in the aerospace industry (e.g., bird strike), consumer products (e.g., “drop test”), materials industry (e.g., metal forming), and across the defense segment. Indeed, the explicit codes are by far the largest consumer of high performance computing (HPC) cycles in the computer aided engineering (CAE) segment.
The explicit codes first became widely used in the mid-1980s on Cray vector computers when they were shown to be a highly effective solution for crash simulation (hence the label “crash codes”). It’s worth a review of the use of explicit codes in the automotive industry to understand their importance. The dramatic growth of HPC in the automotive industry has been driven primarily by the growth in crash and safety simulation. Initially, the logic was that computer simulation would reduce the number of physical crash tests and hence, justify the investment in computing cost. Certainly crash and safety simulation has produced a huge positive ROI.
However, I’d say the main benefit is that crash and safety simulation has enabled auto companies to meet increasingly demanding safety requirements and produce much safer vehicles and save thousands of lives every year. But there continues to be new and increasingly complex crash test requirements combined with the pressure to reduce vehicle weight and compete in a worldwide market. Crash and safety simulation is now a core technology for the auto companies and HPC requirements continue to grow both in terms of capacity to support larger workloads and capability to run the most computationally challenging simulations. Growing demand for improved HPC performance of explicit applications means improved parallel scaling.
Performance of Explicit Structural Analysis Applications
Explicit structural codes are computationally demanding. Thus, there is a constant push to improve their performance requiring codes to fully leverage the HPC hardware architecture. In 1986, DYNA3D® was a practical solution for impact analysis because it was well written for the contemporary vector architectures. Developers of explicit codes were quick to implement the shared memory parallel (SMP) technology. The move to a distributed parallel programming model proved to be more challenging, but eventually running the MPI version of the explicit codes became the norm for large simulations and hence distributed memory systems (e.g., clusters) became the standard architecture for crash simulation.
Although the MPI version has become pervasive over the past 10 years, the number of cores per simulation has been relatively slow to increase. One reason for this slow growth in MPI scaling is that the faster processor frequencies provided increased performance without having to use more cores. Another reason is that the explicit code scaling was perceived to be poor past a few dozen cores. Now that the processor frequencies have plateaued there is strong interest in scaling to several hundred and even thousands of cores. Combined with the demand for larger and more complex simulations, the explicit code developers have stepped up their efforts to improve scaling and recent performance testing is producing impressive results.
Latest Performance Results
The initial MPI implementation of the crash codes gave a significant improvement in parallel performance. The SMP version would typically scale well to four- to eight-way parallel, but the MPI version would go to dozens of cores, providing a large boost in performance over the SMP option. To scale to hundreds of cores, Amdahl’s Law requires that well over 99 percent of the computations scale efficiently. Explicit applications have grown into very large, general purpose programs with thousands of simulation options. Thus, it has been an ongoing effort over the years to improve the scaling throughout the application. This has been especially challenging in the “contact” area, but the complexity of the codes in general makes scaling to thousands of cores a project that requires a team effort.
Recently, Cray has been working with Livermore Software Technology Corporation (LSTC) to test and improve the scaling performance of LS-DYNA program on thousands of cores. The goal is not to scale an artificial test case but rather real production models running 10 to 100 times faster. This requires running large complex simulations and profiling the performance on thousands of cores. This process identifies the parallel bottlenecks and highlights what parts of the application need to improve. This cooperative effort between LSTC and Cray has produced impressive results.
The following figure shows the performance for a crash simulation model scaling to 4,096 cores on the Cray® XC30™ system with the Aries interconnect and 2.6 GHz Intel® “Sandy Bridge” processors (ref. “LS-DYNA Scalability on Cray Supercomputers”, Jason Wang, LSTC and Ting-Ting Zhu, Cray, presented at a 2013 LS-DYNA conference). The graph shows that crash simulation can scale to 4,000 and more cores using the Hybrid parallel option (i.e. combined MPI & SMP parallel). Considering most crash simulations use fewer than 100 cores, this represents a significant potential for improved performance and ultimately benefit product designs.
A second example is from the aerospace industry. This is a work-in-progress and at this point we can’t give details on the model. However, I can say that it is a large model with tens of millions of elements and includes contact. Below are some preliminary results for scaling to over 11,000 cores.
Cray is looking to continue this effort for a broad range of simulations and all of the widely used explicit applications. Again, the goal is to make scaling to thousands of cores the norm for large production impact simulations. Cray is reaching out to the automotive and aerospace communities to get additional models for testing at large core counts. This cooperation between the application developers, key users and Cray is a proven method for enhancing performance and driving simulation to the next level.