Storage for genomic data and sequencing
Need to store billions of small files efficiently? No problem with Qumulo.
Next-generation sequencing (NGS) has increased the storage requirements for genomic data dramatically.
As sequencers become more advanced, they produce more data. Also, efficiency has reduced costs, which means that more organizations can do more sequencing.
Sequencers produce so much data that it’s not uncommon for a single lab to generate more than a billion files in a year. Globally, sequence data doubles approximately every seven months and is outstripping YouTube, Twitter, and astronomy in terms of storage growth.
To keep up, IT administrators are under pressure to find ways to expand and manage their storage infrastructure.
Legacy storage systems, which are based on 15 or even 20-year old designs, cannot meet the demands of modern NGS workflows. IT organizations are now forced to use different solutions for different parts of their NGS workflows to compensate for the inefficiencies in their legacy systems. Multiple systems add complexity, which translates into higher maintenance costs. Multiple systems can also cause data silos, so that one group of researchers may not be able to access data another team is using. Lack of collaboration can slow down how long it takes to get results, which can delay the time it takes for a product to get to market.
Raw NGS data coming from a sequencer consists of many small TIFF files, each about 1K in size. The large numbers of small files slow down the performance of legacy storage systems. When this happens, the compute resources are starved of data and researchers cannot get their results in real time. Slowing down highly paid researchers is not only expensive but can impact time to market.
Small files make up the bulk of an NGS data set but legacy systems store them inefficiently because they rely on mirroring, which wastes storage space. Wasted storage space translates into higher costs, both in terms of the number of disks IT must buy and in infrastructure costs such as rack space, power and cooling.
NGS organizations can end up storing billions of files. Legacy storage systems can’t give the visibility into the storage system IT administrators need to manage so many assets. Legacy systems use separate, off-cluster appliances that rely on obsolete methods to gather data. These methods are sequential processes, such as tree walks, which cannot produce results in a reasonable amount of time when an organization is storing so many assets. It can take days or weeks to get answers to simple questions, long past when those answers can be of any use.
NGS organizations are looking to the cloud for two reasons. One is that, with its scalable, on-demand resources, the cloud is the logical answer when an organization needs extra compute power for a demanding, or unexpected, project. The other is that many NGS organizations share data and collaborate on projects with researchers all over the world. The cloud is one way to make data easily accessible. The challenge is that legacy file storage vendors either have no cloud solution or they offer versions of that have been patched to make them “cloud ready.” Problems with cloud solutions that do exist include poor performance, lack of protocol support and complexity.
Data Sheet: Qumulo for Genomic Sequencing
Qumulo is the file storage system for NGS.
Qumulo’s file system is an ideal solution for storing, managing and accessing genomic sequencing data. It handles small files efficiently, and its support of SMB, NFS, FTP and REST means that all phases of the genomic analysis pipeline can use the same Qumulo cluster. Qumulo is a modern, file storage system that can scale to billions of files and that runs in the data center and the public cloud.
Qumulo’s file system handles small files, such as TIFF and BCL, as efficiently as large ones. With Qumulo, researchers can perform their analyses in real-time which translates into cost efficiencies and faster time to market
Qumulo makes 100% of user-provisioned capacity available for file storage, in contrast to legacy scale-up and scale-out NAS that only recommend using 70% to 80% of usable capacity.
Each time customers add a node to a Qumulo cluster, they scale up linearly, both in terms of capacity and performance.There is no practical limit to the number of files Qumulo can store.
Qumulo’s real-time visibility and control provides information about what’s happening in the storage system, down to the file level. System administrators can apply quotas in real time.
Cloud and on-prem
Continuous replication means you can easily transfer data from your on-prem Qumulo cluster to your Qumulo cluster in AWS, perform your computations, and then transfer the results back to the on-prem storage.
Genomic data storage: NGS workflow
Here is an example workflow for doing NGS on premises:
In this example, the DNA sequencers are generating many small BCL files or base calls, which are unordered DNA sequence fragments. A process of demultiplexing assembles BCL files into a FASTQ file, which is a text file that stores the combined output results of the BCL files along with corresponding quality scores.
The compute farm performs alignment and variant calling. In alignment, sequence fragments are quality checked, preprocessed and aligned to a reference genome. A BAM file is a binary file that stores this alignment data. Variant calling looks for differences between the data and the reference genome. Results are stored in a VCF file.
Once these data stores ready, they can be used for application-specific analysis, which is done by researchers for their own projects. For example, a researcher might be working on a targeted therapy for patients whose tumor has a specific gene mutation. Researchers may use all the data that is generated contained in the BAM and VCF files.
Qumulo provides a central file storage system, that is suited for all types of genomic data. Qumulo has industry-leading small file efficiency and has the throughput to handle all phases of the workflow.
Genomic data storage: NGS workflow on AWS
Here is a workflow example that shows how to perform analysis in the cloud with Qumulo for AWS and EC2 spot instances.
Qumulo enables workflows that span on-premises data centers and the cloud. In this example, the Qumulo cloud cluster on AWS and the local Qumulo cluster are part of the same storage fabric because of continuous replication, which keeps both clusters in sync. An organization can take advantage of EC2 spot instances to keep costs down.
“Our research organization falls between the cracks for most storage vendors, with giant imaging sets and millions of tiny genetic sequencing scraps. Finding a system that reasonably handled all our complex workflows was difficult, and in the end only Qumulo was the right fit.”
Bill Kupiec — IT Manager, Department of Embryology Carnegie Institution for Science