Genomic data and sequencing
Store billions of small files efficiently with high-performance storage.
High performance for genomics workloads.
Qumulo’s file system is an ideal solution for storing, managing and accessing genomic sequencing data. It handles small files efficiently, and its support of SMB, NFS, FTP and REST means that all phases of the genomic analysis pipeline can use the same Qumulo cluster.
Scales to billions of files
Complete REST API
Legacy storage isn’t enough.
Next-generation sequencing (NGS) has increased the storage requirements for genomic data dramatically.
As sequencers become more advanced, they produce more data. Also, efficiency has reduced costs, which means that more organizations can do more sequencing.
Sequencers produce so much data that it’s not uncommon for a single lab to generate more than a billion files in a year. Globally, sequence data doubles approximately every seven months and is outstripping YouTube, Twitter, and astronomy in terms of storage growth.
To keep up, IT administrators are under pressure to find ways to expand and manage their storage infrastructure.
Legacy storage systems, which are based on 15 or even 20-year old designs, cannot meet the demands of modern NGS workflows. IT organizations are now forced to use different solutions for different parts of their NGS workflows to compensate for the inefficiencies in their legacy systems. Multiple systems add complexity, which translates into higher maintenance costs. Multiple systems can also cause data silos, so that one group of researchers may not be able to access data another team is using. Lack of collaboration can slow down how long it takes to get results, which can delay the time it takes for a product to get to market.
Qumulo Storage for Genomic Sequencing
Qumulo’s file system is an ideal solution for storing, managing and accessing genomic sequencing data.
Qumulo’s file system handles small files, such as TIFF and BCL, as efficiently as large ones. With Qumulo, researchers can perform their analyses in real-time which translates into cost efficiencies and faster time to market
Each time customers add a node to a Qumulo cluster, they scale up linearly, both in terms of capacity and performance.There is no practical limit to the number of files Qumulo can store.
Qumulo makes 100% of user-provisioned capacity available for file storage, in contrast to legacy scale-up and scale-out NAS that only recommend using 70% to 80% of usable capacity.
Qumulo’s real-time visibility and control provides information about what’s happening in the storage system, down to the file level. System administrators can apply quotas in real time.
Cloud and on-prem
Continuous replication means you can easily transfer data from your on-prem Qumulo cluster to your Qumulo cluster in AWS, perform your computations, and then transfer the results back to the on-prem storage.
Mixed protocol support
Support of SMB, NFS, FTP and REST means that all phases of the genomic analysis pipeline can use the same Qumulo cluster.
How it Works
Genomic data storage: NGS workflow
Here is an example workflow for doing NGS on premises:
In this example, the DNA sequencers are generating many small BCL files or base calls, which are unordered DNA sequence fragments. A process of demultiplexing assembles BCL files into a FASTQ file, which is a text file that stores the combined output results of the BCL files along with corresponding quality scores.
The compute farm performs alignment and variant calling. In alignment, sequence fragments are quality checked, preprocessed and aligned to a reference genome. A BAM file is a binary file that stores this alignment data. Variant calling looks for differences between the data and the reference genome. Results are stored in a VCF file.
Once these data stores ready, they can be used for application-specific analysis, which is done by researchers for their own projects. For example, a researcher might be working on a targeted therapy for patients whose tumor has a specific gene mutation. Researchers may use all the data that is generated contained in the BAM and VCF files.
Qumulo provides a central file storage system, that is suited for all types of genomic data. Qumulo has industry-leading small file efficiency and has the throughput to handle all phases of the workflow.
Genomic data storage: NGS workflow on AWS
Here is a workflow example that shows how to perform analysis in the cloud with Qumulo for AWS and EC2 spot instances.
Qumulo enables workflows that span on-premises data centers and the cloud. In this example, the Qumulo cloud cluster on AWS and the local Qumulo cluster are part of the same storage fabric because of continuous replication, which keeps both clusters in sync. An organization can take advantage of EC2 spot instances to keep costs down.
“Our research organization falls between the cracks for most storage vendors, with giant imaging sets and millions of tiny genetic sequencing scraps. Finding a system that reasonably handled all our complex workflows was difficult, and in the end only Qumulo was the right fit.”
Bill Kupiec — IT Manager, Department of Embryology Carnegie Institution for Science