Storage for Genomic Data and Sequencing
The ability to extract genetic information– the biological code of all life– has undergone a dramatic transformation in the past decade.
The new techniques are collectively referred to as “next-generation sequencing” or NGS. Compared to the traditional, first-generation sequencing method (“Sanger sequencing”), NGS has higher throughput of genetic sequences, automated production and drastically lower cost.
To put this in context, it took the Human Genome Project ten years and close to three billion dollars to sequence the first human genome. Using NGS, an entire human genome can be sequenced within a single day for around $1000.
The consequence of NGS has been a rapid expansion in the amount of genomic data collected and in the variety of applications that use this data. Today, genetic sequencing serves as the foundation for:
- Primary life sciences research (universities, institutes)
- Diagnostics (clinical uses)
- Drug discovery (pharmas)
- Biomarker discovery (primarily drug companies)
- Personalized medicine (heredity, etc)
- Agriculture and animal research
The benefits of NGS are concrete. For example, 10% of cancer is hereditary. Because of NGS, people can simply arrange with their doctors to have a test that determines if they (and by extension, their family members) are at risk for certain types of cancer. Newborns commonly receive genetic testing. These tests look for genetic defects that can be treated to prevent death or disease in the future. Adults can be tested to determine whether they are carriers for diseases such as cystic fibrosis, Tay-Sachs disease (a fatal disease resulting from the improper metabolism of fat), or sickle cell anemia.
NGS has increased the storage requirements for genomic data dramatically. As sequencers become more advanced, they produce more data. Also, efficiency has reduced costs, which means that more organizations can do more sequencing. Sequencers produce so much data that it’s not uncommon for a single lab to generate more than a billion files in a year. Globally, sequence data doubles approximately every seven months and is outstripping YouTube, Twitter, and astronomy in terms of storage growth. To keep up, IT administrators are under pressure to find ways to expand and manage their storage infrastructure.
Legacy storage systems, which are based on 15 or even 20-year old designs, cannot meet the demands of modern NGS workflows. IT organizations are now forced to use different solutions for different parts of their NGS workflows to compensate for the inefficiencies in their legacy systems. Multiple systems add complexity, which translates into higher maintenance costs. Multiple systems can also cause data silos, so that one group of researchers may not be able to access data another team is using. Lack of collaboration can slow down how long it takes to get results, which can delay the time it takes for a product to get to market.
Raw NGS data coming from a sequencer consists of many small TIFF files, each about 1K in size. The large numbers of small files slow down the performance of legacy storage systems. When this happens, the compute resources are starved of data and researchers cannot get their results in real time. Slowing down highly paid researchers is not only expensive but can impact time to market.
Small files make up the bulk of an NGS data set but legacy systems store them inefficiently because they rely on mirroring, which wastes storage space. Wasted storage space translates into higher costs, both in terms of the number of disks IT must buy and in infrastructure costs such as rack space, power and cooling.
NGS organizations can end up storing billions of files. Legacy storage systems can’t give the visibility into the storage system IT administrators need to manage so many assets. Legacy systems use separate, off-cluster appliances that rely on obsolete methods to gather data. These methods are sequential processes, such as tree walks, which cannot produce results in a reasonable amount of time when an organization is storing so many assets. It can take days or weeks to get answers to simple questions, long past when those answers can be of any use.
NGS organizations are looking to the cloud for two reasons. One is that, with its scalable, on-demand resources, the cloud is the logical answer when an organization needs extra compute power for a demanding, or unexpected, project. The other is that many NGS organizations share data and collaborate on projects with researchers all over the world. The cloud is one way to make data easily accessible. The challenge is that legacy file storage vendors either have no cloud solution or they offer versions of that have been patched to make them “cloud ready.” Problems with cloud solutions that do exist include poor performance, lack of protocol support and complexity.
QF2 is the file storage system for NGS
Qumulo File Fabric (QF2) is an ideal solution for storing, managing and accessing genomic sequencing data. It handles small files efficiently, and its support of SMB, NFS, FTP and REST means that all phases of the genomic analysis pipeline can use the same QF2 cluster. QF2 is a modern, file storage system that can scale to billions of files and that runs in the data center and the public cloud.
QF2 handles small files, such as TIFF and BCL, as efficiently as large ones. With QF2, researchers can perform their analyses in real-time which translates into cost efficiencies and faster time to market
QF2 makes 100% of user-provisioned capacity available for file storage, in contrast to legacy scale-up and scale-out NAS that only recommend using 70% to 80% of usable capacity. Efficient use of disk space decreases the data footprint and saves not just on the cost of the storage system but on infrastructure costs
Real-time visibility and control
QF2’s real-time visibility and control provides information about what’s happening in the storage system, down to the file level, no matter how many files are in the system. System administrators can apply quotas in real time. The capacity explorer and capacity trends tools give IT the information it needs to plan sensibly for the future and not waste money because of overprovisioning. QF2 is so simple to set up and manage that once senior staff define the configuration, day to day management can be done by junior staff
Cloud and on-prem
Organizations that want to move some of their genomic analysis workloads to the cloud can take advantage of QF2 for AWS. QF2 has the highest performance of any cloud offering and is the only file storage system in the cloud with a full set of enterprise features, such as multi-protocol support and real-time visibility.
QF2 uses continuous replication to move data where it’s needed, when it’s needed. Continuous replication creates a copy of the data in a directory on your primary cluster and transfers it to a directory on a second, target cluster. Continuous replication is always running (unless you configure it not to). QF2 takes your latest changes and replicates them without you needing to worry about it.
Continuous replication means you can easily transfer data from your on-prem QF2 cluster to your QF2 cluster in AWS, perform your computations, and then transfer the results back to the on-prem storage.
Each time customers add a node to a QF2 cluster, they scale up linearly, both in terms of capacity and performance.There is no practical limit to the number of files QF2 can store.
Genomic data storage: NGS workflow
Here is an example workflow for doing NGS on premises:
In this example, the DNA sequencers are generating many small BCL files or base calls, which are unordered DNA sequence fragments. A process of demultiplexing assembles BCL files into a FASTQ file, which is a text file that stores the combined output results of the BCL files along with corresponding quality scores.
The compute farm performs alignment and variant calling. In alignment, sequence fragments are quality checked, preprocessed and aligned to a reference genome. A BAM file is a binary file that stores this alignment data. Variant calling looks for differences between the data and the reference genome. Results are stored in a VCF file.
Once these data stores ready, they can be used for application-specific analysis, which is done by researchers for their own projects. For example, a researcher might be working on a targeted therapy for patients whose tumor has a specific gene mutation. Researchers may use all the data that is generated contained in the BAM and VCF files.
QF2 provides a central file storage system, that is suited for all types of genomic data. QF2 has industry-leading small file efficiency and has the throughput to handle all phases of the workflow.
Genomic data storage: NGS workflow on AWS
Here is a workflow example that shows how to perform analysis in the cloud with QF2 for AWS and EC2 spot instances.
QF2 enables workflows that span on-premises data centers and the cloud. In this example, the QF2 cloud cluster on AWS and the local QF2 cluster are part of the same storage fabric because of continuous replication, which keeps both clusters in sync. An organization can take advantage of EC2 spot instances to keep costs down.
Our research organization falls between the cracks for most storage vendors, with giant imaging sets and millions of tiny genetic sequencing scraps. Finding a system that reasonably handled all our complex workflows was difficult, and in the end only QF2 was the right fit.
Bill Kupiec — IT Manager, Department of Embryology Carnegie Institution for Science
Case study: Carnegie Science
Find out how the Department of Embryology tackles volume and variety of research data with QF2
Video: Driving research with QF2
See how the Scientific Computing and Imaging Institute at the University of Utah uses QF2 to power their research.