Next-generation sequencing (NGS) describes the modern nucleotide sequencing technologies that allow analysis of genetic material with unprecedented speed and efficiency. Its advent is shifting genome assembly from a problem of laboratory-based chemistry to one well suited to high performance computing (HPC). In simple terms, NGS involves breaking up long DNA or RNA molecules into millions of small, fragments (50 to 200 nucleotides), defined as a “reads” to be assembled into larger fragments called contigs.
The process of taking genetic material, processing it on a sequencer, passing it to an HPC system for assembly, and outputting digital information in a form useful for research is contained in a “workflow,” the end-to-end flow of genetic information. Across this workflow there are bottlenecks, where individual steps can dramatically slow down of the workflow.
The success of the new technologies — the introduction of sequencers that generate data snippets faster — has come at a price. Sequencers produce reads that are too small (< 150 bp) for commonly used assembler code sets based on overlap-layout-consensus algorithms.
Instead, de Bruijn graph-based assemblers have proven to be successful at assembling short reads. Taken one step further, leveraging distributed memory parallelism can be an important enhancement of the performance and resource utilization of NGS workflows. Cray is working closely with the Broad Institute to optimize one of the modules in Trinity, an open source application for de novo reconstruction of RNA-seq data.
This open source and freely available application combines three independent software modules — Inchworm, Chrysalis and Butterfly — applied sequentially to process large volumes of RNA-seq reads. It partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, then processes each graph independently to extract full-length splicing isoforms and tease apart transcripts derived from paralogous genes.
At the upcoming Intelligent Systems for Molecular Biology (ISMB) conference we’ll review some commonly used NGS workflows and highlight opportunities for improving their performance. Along with sharing our findings on the use of distributed memory parallelism, we’ll highlight our project to parallelize the Broad Institute’s Inchworm module in Trinity RNA-Seq program, enabling efficient scaling to thousands of processors. We’ll focus on the first part of the code (Inchworm) that was parallelized using MPI, keeping in mind the whole structure of this workflow.
Beyond Trinity and Inchworm, we at Cray believe efficient distributed memory parallel implementations improve most types of bioinformatics workflows, and the benefits are very worthwhile.