(This is the first in a series of three posts. The second will discuss critical workflows in oil and gas and the life sciences while the last will speculate about machine learning techniques to optimize such workflows).
A bit of background
In the late ’80s, U.S. federal government agencies became convinced that a substantial effort to fund R&D in high performance computing (HPC) was required to address so-called “grand challenges” — fundamental problems in science and engineering that are ambitious (requiring some advances in science and technology) but achievable. In 1992, the Office of Science and Technology Policy (OST) released a recommendation proposing investments in HPC systems, technology and algorithms, a national research and education network, and basic research. The funding effort was successful and the program long lasting. The term “grand challenges” stuck too: It became a meme within the HPC community and, more generally, within the science, technology, engineering and mathematics (STEM) communities.
Fast forward to today
The definition of an HPC grand challenge has broadened, and the approach toward “solving” the challenge has changed too. No longer is the program solely focused on building up advanced modeling and simulation capabilities in top-down fashion. For example, a major objective of President Obama’s recent Executive Order on creating a National Strategic Computing Initiative (commonly called the “U.S. exascale initiative”) is to increase the “coherence between the technology base used for modeling and simulation and that used for data analytic computing” and creating “an enduring national HPC ecosystem by employing a holistic approach.”
This puts HPC right at the intersection of two big trends in today’s computing (this time driven by industry and commerce, not by policymakers): cloud computing (as a community resource); and big data analytics (BDA). These are two profound trends that will impact “HPC-as-is” but, in return, may also benefit from core competencies developed by the HPC community, if done right. This post will focus on BDA and “critical workflows” for a new (but not totally new) way of doing HPC.
So what exactly is big data analytics (for HPC)?
Big data analytics for HPC goes beyond simple operations such as min, max, average and the like, just as most business intelligence applications go beyond queries that can be written into a MapReduce framework. People call it “dense” analytics for its high processing density. But it is much more. For instance, in bioinformatics applications, an article by Orly Alter and Gene H. Golub can serve as a textbook example of what truly constitutes BDA for HPC: It is a large-scale (whole genome), data-driven (that is, no a priori model assumptions) approach, involving some serious HPC processing (around the use of a generalized SVD), to discover patterns (genelets) and operations (such as cellular replication)…. WordCount it isn’t.
And what is a critical workflow?
The term “workflow” originated in industrial America as a management concept: an orchestrated and repeatable pattern of business activity enabled by the systematic organization of resources (see Wikipedia for more detail). The translation to IT deployment is obvious. (One can only imagine how daunting that will be for exascale size systems, software and facilities! Engaging the supply chain, ramping up manufacturing, and developing rigorous, repeatable build-and-test-and-rebuild processes to meet exascale service level agreements will require considerable expertise and no small measure of ingenuity.)
But this is not quite what is meant with critical workflows for HPC and analytics applications. For every historical HPC grand challenge application, there is now a critical dependency on a series of other processing and analysis steps, data movement and communications that goes well beyond the pre- and post-processing of yore. It is iterative, sometimes synchronous (in situ) and generally more on an equal footing with the “main” application. (In fact, there may not even be a main application any longer). There is a generous amount of quality assurance as well as validation involved. Input data can be massive and is typically sensor data. In such workflows the data is always noisy, models are always incomplete and the task is never truly done: Data gets reprocessed and analyzed de novo sometimes years later. Hence, a thorough understanding of the acquisition and processing history is essential.
Characteristics of a critical workflow
In the cyber/software security world, workflows would be categorized as a “low-attack surface where multiple vectors can attach.” In security, that is obviously not a good thing. In workflows, it goes to the heart of the matter. What is meant by this “low-attack surface”? First, user input fields for workflows are many: From scripting to plain-text keyboard entry to job submission parameters and model descriptions, virtually every aspect of the execution of the workflow can be controlled or altered. Second and third, (communication) protocols and interfaces are mixed; they range from the lightweight REST to the very heavyweight message passing interface protocol for interprocessor communications. Finally, by the very definition of workflow, there is considerable use of software functionalities to provide services. In fact, each step in the workflow can be considered a service to the other steps.
In the next blog post, I’ll try to make these high-level statements on BDA and workflows a lot more specific by discussing emerging workflows in seismic processing and in bioinformatics.