I always enjoy working with our oil and gas customers. Like many of our research and scientific customers, their computational problems are challenging, difficult to solve and directly impact their business. Cray works closely with these customers to advance full-waveform inversion (FWI), reverse time migration (RTM) and other seismic workloads.
Substantial challenges loom for this sector, as problem sizes, resolution requirements and business pressures to reduce computing costs all continue to increase. The historical answers to seismic computing challenges have included raw improvements in hardware subsystems; however, we are at a point where a number of conventional performance indicators are declining:
- Single core/thread performance is no longer improving.
- Memory byte/flops bandwidth is not improving.
- Memory capacities are decreasing — especially for general-purpose computing on graphics processing units and coprocessors.
- Interconnect performance ratios are flattening out.
To address these challenges, suppliers like Intel and NVIDIA are relentlessly increasing the number of cores and threads available in their CPU and GPU/co-processor modules — cores on conventional microprocessors and threads in accelerators and coprocessors. And, as a system supplier, performance increases are coming from increasing the number of accelerators and coprocessors available on our systems.
This poses a significant challenge for developers. It means that “time to solution” improvements will likely come through increased application parallelism: task parallelism in distributed memory to scale an application across multiple coprocessors or multiple compute nodes; task parallelism in shared memory to utilize more than 200 logical cores; and data parallelism to employ the 512-bit vector units found in parts like the Intel® Xeon Phi™ coprocessor.
One of the challenges a developer will face is the orchestration of data movement, on and off of an accelerator/coprocessor and between nodes. However, by taking advantage of different types of parallelism — Message-Passing Interface between nodes, threading on a node, SIMDization on an accelerator and streaming on the host and the accelerator — it should be possible to overlap computations and data movement.
Increasing the use of accelerators and co-processors also introduces a challenge to IT staffs charged with limiting or managing the costs of computing. Datacenter managers are incented to deliver energy-efficient supercomputing, and often focus on tracking power usage effectiveness (PUE) as a measure of a system’s energy efficiency (or “greenness”). I would argue that for computational workloads — such as those found in the oil and gas exploration workflow— “energy to result” rather than PUE, is a more effective measure. The formula for energy to result is quite simple:
Energy to result = (power draw per node) x (number of node hours) x PUE.
Why measure energy to result? Most organizations have to date focused on PUE, but in practice focusing on the first two elements—power draw per node and number of node hours— will have a greater impact. Imagine a scenario where I have two systems available to handle an identical workload. System A is configured with power-efficient nodes (conventional CPUs) while System B is configured with the same number of performance-optimized nodes (accelerator/co-processors). The “energy to result” is considerably less on System B, so which is the energy efficient or “greener” system? I would argue that scalability — the ability for System B to complete the workload in a shorter period of time— makes it more energy efficient; we see this result on Cray® XC40™ systems quite often, where our systems consume less power to complete a job than a similarly configured cluster. The scalability of the completing workload is far more important than the PUE. The Aries network on the XC40 system has a big impact on application performance and scaling.