(This is the third and last blog entry in a series of three. The first one introduced big data analytics and critical workflows. The second post discussed critical workflows in oil and gas and in life sciences. This one will speculate about machine learning techniques to optimize such workflows.)
Machine learning (ML), at first blush, has to be one of the lesser Olympians in today’s IT pantheon: a quick Google search reveals “big data” to be Zeus (863 million entries), closely followed by “analytics” (632 million), while “cloud computing” is trailing at 161 million entries, and machine learning a distant fourth at 59 million entries.
ML is typically defined as those algorithms (and the study thereof) that can discover patterns in data and that get better at it the more data you throw at them. It’s really some sort of automated analytics and statistics on big data. Since those two are the heavy hitters on Google, perhaps this lesser Olympian deserves to be upgraded to Athena, the goddess of intelligent activity.
HPC system telemetry is a source of big data. Modern processing technologies – from processors to chipsets to NICs to disk controllers – are heavily instrumented and monitored, typically for failure-related purposes but also for performance, management and power. Put all those components in very large quantities together in a supercomputer, and the aggregated telemetry is quite impressive. At Cray, the hardware supervisory system (HSS) system is truly a system-within-a-supercomputer (hardware and software, including firmware) and is used to manage and monitor Cray® XC™ systems by collecting that telemetry data in logs and responding to signals. Combining this with application logs will lead to more intelligent scheduling of (independent) applications and higher job throughputs. This, however, may not be enough for optimizing workflows with complex dependencies between constituent apps.
Recall, in the first post in this series, I borrowed a term from cyber/software security and defined workflows in terms of a low-attack surface with many attachments – which implied four characteristics: many user input fields, combined with mixed protocols and interfaces and blocks of software functionalities as a service. In security, this multitude and variety of attachments is a very bad thing. For instance, it has been noted (R. Colbaugh and K. Glass at http://www.security-informatics.com/content/1/1/9) that for such a surface even limited observations at the attachments can result in a fairly thorough understanding of the overall workings of what could otherwise be a very complex system. Worse, those systems also tend to be reachable or controllable; that is, the system can be driven to any desired end state via possibly limited inputs. Those features, bad as they are in security, could nevertheless be very appealing for workflow characterization and optimization. They suggest that it may be possible to do so by measuring or instrumenting at just a few attachments and to get to (near-) optimal results by changing just a few settings. Of course, for workflows, this is a hypothesis.
How would one investigate or prove this hypothesis? By linking the workflow concept to control theory and to cybersecurity, we get two possible immediate benefits: a body of active R&D and software tools from two well-established fields.
- Machine learning’s goal – applied to control theory – is to achieve high performance on very complex systems (think power or chemical plants, reactors, etc.) with limited observations while still guaranteeing safety. One approach goes by the name of “re-enforcement learning and apprenticeship learning.” For workflows, it’s tempting to translate this to supervised learning on best-practices workflows. For the two example workflows discussed in the previous blogs – bioinformatics and seismic processing – such best-practices workflows exist. (For instance, the Broad Institute’s GATK: https://www.broadinstitute.org/gatk/) One can speculate whether emerging workload managers such as YARN can implement optimization policies derived from such supervised learning on best practices workflows.
- Software tools. Cybersecurity and software security are grave concerns, and many companies are bringing tools for monitoring and detection, prevention and response to the market that could be repurposed for workflow analysis purposes. One of many such tools is Microsoft’s Attack Surface Analyzer, which understands the changes in Windows systems’ attack surface resulting from the installation of new business applications and software. This tool allows you to take a “snapshot” of security-related information on a system. At the very least, such tools provide a useful catalog on all protocols, interfaces and pipes in a system, but additional methodologies and metrics for such tools have been documented (Manadhata, P. K., & Wing, J. M. (2010). An Attack Surface Metric. IEEE Transactions on Software Engineering) as well.
I introduced “critical workflow” as a natural successor to the grand challenges meme for HPC: where grand challenges are remarkable entrees, workflows are thoughtful, well-crafted menus. Combine that with the anticipated size of, and long and diverse customer lists for, the upcoming exascale “restaurant,” and the need for some serious optimization of those workflows (well beyond running the best and most efficient kitchen, to exhaust the analogy) becomes evident. The tools and methodologies to do so, borrowed from the more established fields of control theory and software security, may already exist: applying emerging machine learning algorithms to system and application data.