When the Met Office chose Cray to supply three large XC40 supercomputers, Cray’s CEO, Peter Ungaro, made a bold statement. He said, “You will be installing the largest operational systems in the world. There will be problems at scale that you won’t have anticipated. Partnership with Cray will allow you to access our deep expertise and solve these problems.” Recent ambitious upgrades to the Met Office’s forecasting codes have meant we have been able to test this claim.
Operational weather forecasting is unlike many other areas of science because it is critical that the computer models run to a strict time schedule. A forecast that takes too long to run isn’t available in time for customers to make decisions, and so is worthless. The Met Office’s global forecast model has just 1 hour in which to calculate the next week’s weather. In September 2017, the resolution of this model was improved hugely from 17km to 10km. This upgrade gave good improvements to many aspects of the forecast skill but required a huge increase in computational cost, which had to be achieved without increasing the time the model took to complete.
Initial runs of the upgraded model, on a lightly loaded XC40 system, suggested that we should be able to run the model in 55 minutes. Unfortunately, when this was then implemented in our pre-operational suite, on a busy system alongside all the other components of our suite of forecasts, we experienced significant variability in our runtimes. On this system some runs were taking up to 74 minutes. This was clearly unacceptable for an operational model, so we started to analyse the problem in greater depth. We discovered that while most model time-steps were taking just over 1 second, there were a significant number that took over 20 seconds. Moreover, by running the job with some other work stopped, we were able to deduce that this large forecast was being hit by congestion on the Aries interconnect that links the thousands of nodes in the supercomputer.
It was at this point that we got in contact with Cray; after all, who knows the performance characteristics of Aries any better than its designer? With our local Cray analyst and the Met Office sys-admins we found better ways to schedule this large job to maximise the interconnect performance and reduce the impact of other work. The Met Office HPC Optimisation Team also developed some code changes to reduce communications impact. Finally, a suggestion from Cray’s Aries guru, Duncan Roweth, led us to investigate different routing options for MPI traffic. The result of all this was a much more reliable runtime, of around 54 minutes, that allowed us to go live with the new model.
This wasn’t the end of the effort, however. In the subsequent months, Duncan made further suggestions about how the same routing options could be applied to I/O traffic to reduce impact from other jobs. The Met Office HPC Optimisation team also created a raft of new code optimisations targeted at the high numbers of nodes we were now using. Put together, these have enabled us to reduce the number of nodes used for the forecast by 25% and improve the runtime to 47 minutes. Met Office scientists are already working out which of the potential forecast improvements they are working on can best make use of the space made!
None of this would have been possible if Cray had not been so responsive to our problems, and willing to work closely with us to solve them. With the Met Office and Cray working as true partners, more was achieved than could have been achieved by either party on their own, and the Met Office performance expectations were not just met, but substantially exceeded.