A recent article in Nature, “The Power of Petabytes”, by Michael Eisenstein, reviews how exponentially increasing life science data exceeds our present abilities to process and make sense of it. Even while continuing to grow unbounded, data sets are still often not yet large enough to draw convincing conclusions.
Computation is one obvious problem. “The computation scales linearly with respect to the number of people,” says Marylyn Ritchie, a genomics researcher at Pennsylvania State University in State College. “But as you add more variables, it becomes exponential as you start to look at different combinations.” To efficiently harness growing computing resources, researchers will need to leverage scalable algorithmic approaches and system architectures.
Another challenge in life sciences is data storage. Like computational challenges, many unique characteristics in genome analysis don’t arise until reaching the petabyte-scale domains now common in life sciences. From my own experiences, storage is indeed a significant problem or, more accurately, set of problems.
While prices/TB continue to drop for hard drives and tape, one problem is building usable and reliable storage volumes that can exceed 100TB each. As average sizes of BAM files, a de facto standard for sequence data, climb over 300GB each, traditional file systems are increasingly incapable as reliable repositories. Until now, it was marginally acceptable to live with a larger number of smaller file systems, but large files force even larger file systems.
As individual hard drives increase in size, more and more elaborate underlying software methods are required to guarantee data integrity. RAID5 was a long-time standard way of grouping disks to allow reconstruction on failure. It is now obsolete as the chance of a second – and fatal – error during reconstruction is too high given individual drive sizes. One popular replacement, RAID6, adds another drive’s worth of redundancy, but is itself reaching the end of its practical life. The next step in data protection is erasure coding which allows reliability to increase with the scale and size of the data set, assuming one has the drives, interconnect, and compute power for their encoding.
When we start talking about petabyte+ file systems and their use in research, not only do we have enormous files, we have lots of duplicate data during analysis. A large research file system is something like an untidy basement – often hard to find things and just as often one is overwhelmed by everything else on there.
Traditional file systems lack the tools to keep track of their contents in real time. While an administrator can often figure out who is using what resources, it’s mostly a forensic activity requiring custom scripts to be run and interpreted. Administrators may be able to allocate file system volumes from a pool of raw disk, but they have little knowledge of what files are active and which are inert duplicates. The file formats, BAM for example, are often envelopes around deeper level abstractions – the file itself is actually a file system of sorts.
So we have an increasing torrent of data pouring into the storage complex, which is marginally possible to be saved in a reliable way. It’s often difficult to find once stored, and much more as a file slips from immediate active use.
At the current time, the workflows, at least in research, are not standardized in any meaningful way. And they can hardly be, given the free flow nature of research. Genomics and proteomics data can come cleanly off of sequencers, be assembled and dumped to disk, but what after that?
One can argue for new file systems built from the ground up for much larger data types than the most common NTFS or ext4 file systems today. Perhaps this is what’s driving the shift to object file systems these days?
On the surface, object file systems like Ceph or Swift trade off the complexities (and functionalities) of conventional file systems for simplicity of storage and retrieval. Many of these also emphasize their ability to tag files with arbitrary metadata allowing files to be individually described in much more detail.
It is often ignored though, that these systems, for the sake of their own simplicity of operation, are layered on top of existing file systems, which in turn are layered on top of existing RAID systems for data integrity.
How do we solve these problems? We know how to solve the different levels quite well (hardware, data integrity, filesystems), but how does one reliably integrate them into a whole? And probably most importantly, do it in a way that enables effective use of this data rather than yet another patch on a heritage structure.