We recently had the pleasure of helping Jason Roszik and his colleagues at the University of Texas MD Anderson Cancer Center in developing a high-throughput architecture supporting their work in identifying combination therapies for cancer. This work sits at the interface of some major technology, processing and clinical trends, and it was quite an eye-opener — as well as a motivation — for us on how to use Cray-developed systems and processing technologies to build a useful and productive high-throughput IT architecture.
The first trend, of course, is next-generation sequencing (NGS). Costs are going down and sequencing throughput is going up dramatically, to where today’s NGS companies state they can process tens of human genomes a day at around or less than $1,000 each. Perhaps the days of routine genomic analysis on clinical samples are not all that far off. Secondly, clinical information databases are growing, are being consolidated and are being linked increasingly to electronic health record (EHR) systems. In conjunction with these two trends, from the cancer research side, there is good and continuing progress in understanding molecular signaling pathways. However, there is also the disappointing observation that monotherapy seldom leads to a cure in oncology, and hence a major challenge in cancer drug discovery today is to identify effective combination therapies. Together, these trends conspire in an as-yet-unmet need for a productive, collaborative and user-friendly IT architecture that can not only handle massive amounts of diverse data — some of it residing on the web — but also has some nifty discovery-like tools and technologies built in.
To rise to the first part of the challenge — consolidated databases and processing — we chose Spark™, a user-friendly distributed system architecture for efficient processing of large and diverse data. Spark bundles a variety of tools and techniques, originally developed for at-scale distributed processing, that are relevant in a clinical setting. Those include meta-scheduling of multiple dependent applications for optimized queries, including SQL and graph, and workflows and rapid in-memory processing for large data volumes, greatly minimizing time-consuming data prepping. Spark also provides support to several visualization and analysis tools and languages, such as the R programming language, that are frequently used in biomedical research. Spark is flexible with respect to data import/export and data management, and it also has tools for visualizing results, while making the work accessible between collaborators in an open and reproducible framework.
For the second part of the challenge — discovery — we looked at our own Cray Graph Engine (CGE). Viewing and mining data in a graph database, in addition to relational databases or key-value table stores, can uncover valuable and previously hidden relationships. Specifically, the Cray Graph Engine is a Resource Description Framework (RDF) triplestore (i.e., a standard model that allows data interchange on the semantic web) that leverages Cray high-performance hardware, parallel software and graph libraries to accelerate graph analytics for very large graphs. In practice, this was a multi-step procedure. First, we used CGE to identify and depict relevant pathway targets by organizing our results as a graph and labeling each of the candidate gene pairs with their respective pathway categories. We were then able to query and visualize this network to illuminate how the candidate genes are connected in terms of their pathway categorizations. Subsequently, we then built our graph from our results and two other publicly available datasets, HUGO and Reactome, and built a single graph from all three data sources — two public, one private. Finally, using CGE’s SPARQL interface to query/update this integrated graph, we were able to label each of our candidate genes with its associated top-level Reactome pathway category.
As to how all this was used in practice and to what research effect, we urge the reader to consult the paper. We derived quite a bit of professional and personal satisfaction out of being able to put together an IT architecture that leverages our company’s R&D efforts and contributes meaningfully to important research in oncology.
Thanks to Cray solutions architect Matt Gianni for providing valuable input for this blog article.