There’s a lot of hype around big data in healthcare and the life sciences. But big data is here to stay. Information is what drives the entire industry.
When I worked in big pharma, I learned that the product of a pharmaceutical company is not a pill, it’s a label. And a label is just a specific assemblage of information, carefully distilled from terabytes of information collected and analyzed over the course of many years by many intelligent people. To compete, companies have to be very good at turning data into information, and information into knowledge. The stakes couldn’t be higher, because every day millions of patients rely on the quality of this data and the strength of the analyses done by researchers.
Analyzing big data is difficult. The data volumes are enormous and diverse, and they come from all over the place in every format imaginable — from massive text files to spreadsheets to “flat file databases” to large biomedical image files. We’re getting better and faster at collecting all of this data, sometimes because we have to (due to regulations) and sometimes just because we can (using wearable devices, for example). All of these factors make analyzing big data a daunting task. And to top it all off, the data scientists you hire to analyze big data need both computer science skills and domain expertise. They’re hard to find, and their time is extremely valuable.
What makes it harder is the fact that the HPC computing environment in healthcare and life sciences is not stagnant — far from it. As big data grows bigger and computing trends and technologies such as Spark™, Hadoop® and graph databases become ubiquitous, HPC is no longer exclusively for the research parts of the organization. It often needs to be readily available to everyone across the enterprise. Powerful tools like deep learning can and should be leveraged throughout the organization, but they can require more computing power than can be provided by conventional compute architectures.
Compounding these problems is the fact that the rate of innovation in healthcare and life sciences easily outpaces the rate at which IT can refresh infrastructure and update best practices. Many of these advances have significant impacts on IT. Today’s next-generation sequencing (NGS) workflows, for example, are as much computer science as they are biology. IT departments are expected to build efficient solutions for NGS and everything else that will last for years, for researchers who, through no fault of their own, usually can’t accurately articulate their needs beyond a few months. IT has a demanding job, on that gets bigger every day, and the ground continues to shift.
Cray has just announced a new platform, Urika®-GX, which was designed to help IT professionals and the researchers they support respond to the biggest data challenges in a data environment that is constantly changing in size, scope and complexity. The Urika-GX platform has the requisite impressive Cray architecture delivered in an industry-standard 42U 19-inch rack and comes pre-integrated with Hadoop and Spark. It also leverages Cray’s Aries™ interconnect to power the Cray Graph Engine (CGE), which can compute over very large semantic graphs 10 to 100 times faster than current graph solutions.
We are particularly excited about Urika-GX system’s extraordinary graph capability. As of June 2016, graph databases are still the fastest-growing type of database. One of the many reasons for this increasing popularity is the growing awareness that graph databases are much better than relational databases at representing relationships between entities (for more information on graph databases, see Graph Databases 101). They can be used to represent, in a very natural way, “everything you know” about any domain, like cancer biology, for example, and they can be used further to detect patterns of relationships or interactions between entities. Much of this can be very difficult or even impossible to do in relational databases.
CGE on the Urika-GX platform lets researchers build graphs containing many tens of billions of relationships, integrated from all manner of sources, and interrogate those graphs flexibly and quickly to gain insights. Graphs have been used in life sciences for decades to represent biochemical and biological pathways at the molecular and cellular level. Today, they are being used for everything from cancer genomics and cell morphology to analysis of patient electronic health records to cybersecurity, with more applications coming every day. On conventional architectures, computation slows dramatically as the graphs get bigger, even when you add nodes to the cluster. With CGE on the Urika-GX system, graphs scale smoothly to deliver results quickly.
All of that power optimized for both Hadoop and Spark, as well as large graphs, in the same machine is game changing for big data analytics. The Urika-GX platform will have your data scientists spending less time waiting and more time producing the insights you need for success.
To learn more, check out this video series covering how supercomputing and analytics solutions can help you make the most of your life sciences IT investment.