Genomic Sequencing

The ability to extract genetic information-- the biological code of all life-- has undergone a dramatic transformation in the past decade. The new techniques are collectively referred to as “next-generation sequencing” or NGS. Compared to the traditional, first-generation sequencing method (“Sanger sequencing”), NGS has higher throughput of genetic sequences, automated production and drastically lower cost.

To put this in context, it took the Human Genome Project ten years and close to three billion dollars to sequence the first human genome. Using NGS, an entire human genome can be sequenced within a single day for around $1000.

The consequence of NGS has been a rapid expansion in the amount of genomic data collected and in the variety of applications that use this data. Today, genetic sequencing serves as the foundation for:

  • Primary life sciences research (universities, institutes)
  • Diagnostics (clinical uses)
  • Drug discovery (pharmas)
  • Biomarker discovery (primarily drug companies)
  • Personalized medicine (heredity, etc)
  • Agriculture and animal research

The benefits of NGS are concrete. For example, 10% of cancer is hereditary. Because of NGS, people can simply arrange with their doctors to have a test that determines if they (and by extension, their family members) are at risk for certain types of cancer. Newborns commonly receive genetic testing. These tests look for genetic defects that can be treated to prevent death or disease in the future. Adults can be tested to determine whether they are carriers for diseases such as cystic fibrosis, Tay-Sachs disease (a fatal disease resulting from the improper metabolism of fat), or sickle cell anemia.

NGS has increased the storage requirements for genomic data dramatically. As sequencers become more advanced, they produce more data. Also, efficiency has reduced costs, which means that more organizations can do more sequencing. Sequencers not produce so much data that it’s not uncommon for a single lab to generate more than a billion files in a year. Globally, sequence data doubles approximately every seven months and is outstripping YouTube, Twitter, and astronomy in terms of storage growth. To keep up, IT administrators are under pressure to find ways to expand and manage their storage infrastructure.

Legacy storage systems, which are based on 15 or even 20-year old designs, cannot meet the demands of modern NGS workflows. IT organizations are now forced to use different solutions for different parts of their NGS workflows to compensate for the inefficiencies in their legacy systems. Multiple systems add complexity, which translates into higher maintenance costs. Multiple systems can also cause data silos, so that one group of researchers may not be able to access data another team is using. Lack of collaboration can slow down how long it takes to get results, which can delay the time it takes for a product to get to market.

Performance Challenges

Raw NGS data coming from a sequencer consists of many small TIFF files, each about 1K in size. The large numbers of small files slow down the performance of legacy storage systems. When this happens, the compute resources are starved of data and researchers cannot get their results in real time. Slowing down highly paid researchers is not only expensive but can impact time to market.

Efficiency Challenges

Small files make up the bulk of an NGS data set but legacy systems store them inefficiently because they rely on mirroring, which wastes storage space. Wasted storage space translates into higher costs, both in terms of the number of disks IT must buy and in infrastructure costs such as rack space, power and cooling.

Visibility Challenges

NGS organizations can end up storing billions of files. Legacy storage systems can't give the visibility into the storage system IT administrators need to manage so many assets. Legacy systems use separate, off-cluster appliances that rely on obsolete methods to gather data. These methods are sequential processes, such as tree walks, which cannot produce results in a reasonable amount of time when an organization is storing so many assets. It can take days or weeks to get answers to simple questions, long past when those answers can be of any use.

Cloud Challenges

NGS organizations are looking to the cloud for two reasons. One is that, with its scalable, on-demand resources, the cloud is the logical answer when an organization needs extra compute power for a demanding, or unexpected, project. The other is that many NGS organizations share data and collaborate on projects with researchers all over the world. The cloud is one way to make data easily accessible. The challenge is that legacy file storage vendors either have no cloud solution or they offer versions of that have been patched to make them "cloud ready." Problems with cloud solutions that do exist include poor performance, lack of protocol support and complexity.

QF2 is the file storage system for NGS

Qumulo File Fabric (QF2) is an ideal solution for storing, managing and accessing genomic sequencing data. It handles small files efficiently, and its support of SMB, NFS, FTP and REST means that all phases of the genomic analysis pipeline can use the same QF2 cluster. QF2 is a modern, file storage system that can scale to billions of files and that runs in the data center and the public cloud.



Performance

QF2 handles small files, such as TIFF and BCL, as efficiently as large ones. With QF2, researchers can perform their analyses in real-time which translates into cost efficiencies and faster time to market



Cost

QF2 makes 100% of user-provisioned capacity available for file storage, in contrast to legacy scale-up and scale-out NAS that only recommend using 70% to 80% of usable capacity. Efficient use of disk space decreases the data footprint and saves not just on the cost of the storage system but on infrastructure costs



Real-time visibility and control

QF2's real-time visibility and control provides information about what's happening in the storage system, down to the file level, no matter how many files are in the system. System administrators can apply quotas in real time. The capacity explorer and capacity trends tools give IT the information it needs to plan sensibly for the future and not waste money because of overprovisioning. QF2 is so simple to set up and manage that once senior staff define the configuration, day to day management can be done by junior staff



Cloud and on-prem

Organizations that want to move some of their genomic analysis workloads to the cloud can take advantage of QF2 for AWS. QF2 has the highest performance of any cloud offering and is the only file storage system in the cloud with a full set of enterprise features, such as multi-protocol support and real-time visibility.

QF2 uses continuous replication to move data where it's needed, when it's needed. Continuous replication creates a copy of the data in a directory on your primary cluster and transfers it to a directory on a second, target cluster. Continuous replication is always running (unless you configure it not to). QF2 takes your latest changes and replicates them without you needing to worry about it.

Continuous replication means you can easily transfer data from your on-prem QF2 cluster to your QF2 cluster in AWS, perform your computations, and then transfer the results back to the on-prem storage.

NGS workflow

Here is an example workflow for doing NGS on premises

usecase-imaging-diagram

In this example, the DNA sequencers are generating many small BCL files or base calls, which are unordered DNA sequence fragments. A process of demultiplexing assembles BCL files into a FASTQ file, which is a text file that stores the combined output results of the BCL files along with corresponding quality scores.

The compute farm performs alignment and variant calling. In alignment, sequence fragments are quality checked, preprocessed and aligned to a reference genome. A BAM file is a binary file that stores this alignment data. Variant calling looks for differences between the data and the reference genome. Results are stored in a VCF file.

Once these data stores ready, they can be used for application-specific analysis, which is done by researchers for their own projects. For example, a researcher might be working on a targeted therapy for patients whose tumor has a specific gene mutation. Researchers may use all the data that is generated contained in the BAM and VCF files.

QF2 provides a central file storage system, that is suited for all types of genomic data. QF2 has industry-leading small file efficiency and has the throughput to handle all phases of the workflow.



NGS workflow on AWS

Here is a workflow example that shows how to perform analysis in the cloud with QF2 for AWS and EC2 spot instances.

usecase-genomics-diagram2

QF2 enables workflows that span on-premises data centers and the cloud. In this example, the QF2 cloud cluster on AWS and the local QF2 cluster are part of the same storage fabric because of continuous replication, which keeps both clusters in sync. An organization can take advantage of EC2 spot instances to keep costs down.

Our research organization falls between the cracks for most storage vendors, with giant imaging sets and millions of tiny genetic sequencing scraps. Finding a system that reasonably handled all our complex workflows was difficult, and in the end only QF2 was the right fit
Bill Kupiec -- IT Manager, Department of Embryology
Carnegie Institution for Science

More resources

3card-cs-carnegie

Case study: Carnegie Science

Find out how the Department of Embryology tackles volume and variety of research data with QF2

Download now
3card-webinar-sci

Video: Driving research with QF2

See how the Scientific Computing and Imaging Institute at the University of Utah uses QF2 to power their research

Watch Now
3card-wp-fusefx

QF2 Technical Overview

QF2 is designed to meet today’s requirements of scale and data mobility. It is the world’s first universal-scale file storage system

Download Now
3card-ds-aws-a

QF2 on AWS Data Sheet

QF2 is a highly scalable file storage system that runs in the data center and in AWS. Find out more by downloading the QF2 on AWS data sheet (PDF, 2 pages)

Download Now
3card-try-qf2

QF2 on AWS Data Sheet

QF2 is a highly scalable file storage system that runs in the data center and in AWS. Find out more by downloading the QF2 on AWS data sheet (PDF, 2 pages)

Try It Now
3card-analyst-taneja

Taneja Group on QF2 on AWS

Find out what technology analysts from the Taneja Group think of QF2's ability to expand on-premises file storage systems to the cloud (PDF, 6 pages)

Download Now

Log In to Qumulo

Log In

Let's start a conversation

We are always looking for new challenges in enterprise storage. Drop us a line and we will be in touch.

Contact Information

REACH US

EMAIL

General: info@qumulo.com
PR & Media: pr@qumulo.com

WORK WITH US

SUPPORT

Search

Enter a search term below