(This is the second in a series of three blog entries. The first post introduced big data analytics and critical workflows, this one will discuss critical workflows in oil and gas and in the life sciences, and the last will speculate about machine learning techniques to optimize such workflows).
In the first post, I defined workflows in terms of a low-attack surface which implied four characteristics: many user input fields, combined with mixed protocols and interfaces and blocks of software functionalities that are organized as a service to each other. In addition, critical workflows are those without which R&D or engineering doesn’t get done. I also gave an example of big data analytics (BDA) and noted an article on the analysis of genome-scale data that described large-scale, data-driven, HPC processing applied to the discovery of patterns and operations. Let’s make the link between the two concepts a bit clearer, at least when investigated in two different applications spaces: bioinformatics and seismic processing.
In bioinformatics, closer investigation of the Alter and Golub article reveals eight direct data sources (mostly messenger RNA, or mRNA, expression data) in tab-delimited text-formatted files. Sounds straightforward, but it isn’t. First, genetic assay analysis is a fairly involved process by itself, typically involving multiple software modules. Going from raw data to tab-delimited text files took some work!
Second, any analysis such as Alter and Golub’s will be complemented by others. Many analysis toolkits are out there (most of them on the Web) that cover the general areas of differential analysis, class prediction and discovery, pathway analysis and the like. It’s safe to assume that researchers will, in fact, try to convey relevance from the different results when processing with different toolkits. (Not unlike, say, El Niño weather predictions, which typically compare multi- member forecasts, not only for better statistics but also for understanding which “physics module” each model is slightly better at). Finally, in a neat twist, most of the analysis takes place leveraging annotated data repositories, that annotation itself the result of previous analysis and possibly updated by the current one.
Hence, BDA in bioinformatics scores four out of four to qualify as a critical workflow: many user inputs, mixed protocols and interfaces and diverse software modules in service and support of each other. Merging processing with analytics in complex workflows will only accelerate in the near future: Precision medicine informatics will become the biggest driver of new bioinformatics tools and methods (it’s even becoming a required study program at some medical schools).
Seismic processing has long been associated with HPC (and big data). Seismic exploration companies have always acquired/produced huge volumes of data and consumed vast amounts of processing cycles. In a previous post I speculated that the oil and gas industry might be the first industry to make it to exascale, closely behind the government-funded R&D centers. (That was before the price of oil dropped precipitously, but even so, the industry continues to expand its processing capabilities quickly). It has always been understood, too, that seismic processing contains many processing steps, software modules and I/O schemes that can be separately tuned, optimized and organized in support of the overall processing approach. The effort culminates in a grand challenge processing finale — today typically a spectral migration method. Furthermore, during the last few years, two new developments have broadened and deepened this already complex signal processing workflow and given it a pronounced analytics flavor:
- Processing capabilities have merged with new acquisition and binning technologies to where modern surveys can produce almost direct, sensor evidence for fracture characterization.
- Full waveform inversion is usually classified as a nonlinear data fitting procedure, but, in combination with migration, it falls into the broader category of data-driven model building. As it can be recast in terms of optimal substructures, it could benefit from dynamic programming methods.
As in the bioinformatics case, we can identify in seismic processing the same four characteristics of a critical workflow. In contrast to bioinformatics, however, seismic processing pushes the data volume considerably harder, but is less varied.
Having shown the complexity of workflows in two applications domains, I’ll use the final post to discuss how machine learning techniques, in conjunction with best practices, can reduce that complexity substantially.