New Computational Tools Leverage Hadoop and Spark for NGS Workflows

Life sciences research continues to evolve rapidly. While structural biology, drug discovery and materials science are still important, we’re seeing an increasing focus on analytics and the more effective use of data. The race to understand patients and diseases at the molecular level to achieve precision medicine is fueling this shift.

We recently wrote about precision medicine and the role genomics is having in improving cancer care. The development of next-generation sequencing (NGS) technologies is explosively growing the use of genomics for research related to human disease, agriculture and evolutionary sciences; with new developments aimed at greater accuracy, faster results and lower costs.

With the rapid evolution of technologies,  life science organizations are struggling to keep their conventional compute infrastructure up to date. The industry challenge is summarized by Chris Dagdigian of The BioTeam: “. . . [T]oday’s Bio-IT professionals have to design, deploy, and support IT infrastructures with life cycles measured over several years, in the face of an innovation explosion where major laboratory and research enhancements arrive on the scene every few months.”

At Cray, we are meeting the increased use of analytics and big data within the life science research community by incorporating supercomputing technologies into analytics solutions. Additionally, we are seeking out a few strategic partners  who leverage advances in information technology such as Apache Spark™ and Hadoop®. Historically, bioinformatics codes have not leveraged Hadoop or Spark; but a few companies — Lumenogix and BioDatomics — have developed wrappers to bioinformatics codes to leverage these analytics environments. Wrapping bioinformatics code eliminates the need to rewrite code you would want to move to Spark or Hadoop; providing a convenient way to incorporate updates and changes. Another key benefit to working with these partners is the ability to capture relevant metadata associated with NGS analysis, enabling researchers and analysts to repeat experiments and giving them the necessary information to compare results across different analytical runs.

Now the cool stuff. We’re working with these partners and others, including Intel, to test NGS workflows on Cray’s Urika-XA™ extreme analytics platform. The Urika-XA platform is optimized for Spark and Hadoop environments, with over 1,500 cores, fast SSD storage at the node level, a POSIX-compliant parallel file system and 6 TB of RAM, all in one cabinet. This unique architecture enables researchers to run their NGS workflows, perform reanalysis and then go on to perform any other analysis or annotation that can run on Hadoop or Spark while limiting data movement and still maintaining a small footprint in the data center. Our initial results are exciting: We’ve increased the number of samples that can be processed in parallel and reduced the time it takes to process both exome and whole-genome samples.

Join us next week at the annual Bio-IT conference where we’ll discuss how high performance computing technologies and analytics are applied to NGS workflows.

Speak Your Mind

Your email address will not be published. Required fields are marked *