Data Storage for Genomic Sequencing

February 14, 2018

Authored by:

Qumulo Team

A genome is the entire set of hereditary instructions for building, running, and maintaining an organism, and passing life on to the next generation. Genomic sequencing figures out the order of DNA nucleotides, or bases, in a genome—the order of its bases—the As, Cs, Gs, and Ts– that make up an organism’s DNA. The human genome is made up of over 3 billion of these genetic letters.

Learn more: Qumulo data storage for genomic sequencing

Genomic sequencing has undergone a dramatic transformation in the past decade. New techniques have been developed that are collectively referred to as “next-generation sequencing” or NGS. Compared to first-generation sequencing (“Sanger sequencing”), NGS has much higher throughput of genetic sequences, automated production and drastically lower cost. Using NGS, an entire human genome can be sequenced in a single day. In contrast, first generation techniques required over a decade to deliver the final draft of a single human genome. Estimates for how much it cost to map that first genome go as high as 3 billion dollars. Today, it would cost around $1,000.

Why does genomic sequencing matter?

Better, faster and cheaper genomic sequencing means that its impact on our lives is much greater. Researchers now are able to compare large stretches of DNA from different individuals quickly and cheaply. Such comparisons can yield an enormous amount of information about the role of inheritance in susceptibility to disease and in response to environmental influences. In addition, the ability to sequence the genome more rapidly and cost-effectively creates vast potential for diagnostics and therapies.

More concrete examples are the types of genetic tests that are becoming routine. Many people have genetic carrier tests to check for disorders that they can pass on to their children. Other tests can determine hereditary risks for certain types of cancers.

What does this mean for data storage for genomic sequencing?

Aside from the “This is so cool, I have to tell you about it” factor, why am I blogging about genomic sequencing?

The DNA fragments from biological samples are extracted by machines called sequencers. The whole genome can’t be sequenced all at once because the methods we have today can only handle short stretches of DNA at a time. Consequently, those sequencers produce lots and lots of small files. The raw image files are usually TIFF files, about 1KB apiece, with a total of 2-5TB per sample.

Data Storage should be fast and efficient

Any machine that produces so many small files is going to need a storage system that has great performance and that stores and protects small files efficiently. Techniques such as mirroring can waste a lot of disk space. Wasted disk space means companies have to buy more storage, use up more rack space and pay more for infrastructure costs such as power and cooling.

Qumulo is way more efficient at representing and protecting small files than legacy scale-out NAS, typically requiring one third the storage capacity and half the protection overhead.

I/O always matters

The process of refining the raw data—transforming the fragmented rough draft into a long, continuous final product without breaks or errors—is called finishing. Finishing involves different types of analyses, including hooking all the individual reads together into the proper order, checking for mistakes and gaps, and looking for differences between the final result and a reference genome. All these steps produce different types of files and all these steps require excellent I/O performance for fast analysis.

Fast I/O matters if there are lots of researchers on the other end of the workflow who are using the finished data for their own projects. Downstream researchers want to do their work in real time, not wait around because their own compute resources are starved of data.

Qumulo provides two times the price performance compared to legacy storage systems.

Storage should scale to billions of files

Very few organizations have just one sequencer. They have rows of them, all producing TBs of data a day. Even just a few sequencers can produce over a billion files a year, taking up 1-2PB of storage. Different stages of the analyses are also stored for different amounts of time. While the raw TIFF files may only be stored for a few weeks, the other types of files may be stored for years. Huge volumes of data mean that the file storage must easily scale and, even better, adding a node should not only add capacity but performance.

With Qumulo, you can use any mix of large and small files and store as many files as you need. There is no practical limit with Qumulo’s advanced file-system technology. Many Qumulo customers have data footprints in excess of a billion files.

Visibility and control is crucial

When you have billions of files in a storage system, you need a way to manage them. Sequential techniques such as tree walks don’t work anymore. Getting information about the data can take days or even weeks, which means it’s useless.

Qumulo gives real-time visibility into the data and makes it easy for administrators to find out answers to questions like where the I/O hotspots are and take instant action.

Multi-protocol support

Many sequencers send their data to storage over SMB but many researchers access the data over NFS. A storage system needs to support multiple protocols. Qumulo supports SMB, NFS, FTP and REST.

Moving to the cloud

Organizations are looking to the cloud to give them more compute resources for their analyses. They’re hampered because many of the options for file storage in the cloud have poor scalability and performance.

Qumulo Cloud Q for AWS has the highest performance of any file storage in the cloud, as well as being the most scalable. Unlike other options, performance and capacity can be scaled independently.

Qumulo uses continuous replication to move data where it’s needed, when it’s needed. Qumluo takes your latest changes and replicates them without you needing to worry about it. Continuous replication means you can easily transfer data from your on-prem Qumulo cluster to your Qumulo cluster in AWS, perform your analyses, and then transfer the results back to the on-premises storage.

Try the best data storage for genomic sequencing today

If you’re in a research group or company that’s doing genomic sequencing, make sure you ask the right questions before you buy a file storage system.

If you’re interested in learning more about how the Qumulo architecture can save you money while giving you capacity and scalability, read the Qumulo File Data Architecture Technical Guide.

Top 3 Benefits of Scaling Azure Virtual Desktops With Azure Native Qumulo Storage

In today’s dynamic business landscape, organizations are constantly seeking ways to enhance productivity, streamline operations, and save infrastructure costs. The

Azure Virtual Desktop costs are too high if you launched using Azure Files.

Adopting a virtual desktop strategy has enormous benefits, and Azure Virtual Desktops is a great solution. Azure Virtual Desktop (AVD)

Data Storage for Genomic Sequencing

Authored by:

Why does genomic sequencing matter?

What does this mean for data storage for genomic sequencing?

Data Storage should be fast and efficient

I/O always matters

Storage should scale to billions of files

Visibility and control is crucial

Multi-protocol support

Moving to the cloud

Try the best data storage for genomic sequencing today

Related Posts

Top 3 Benefits of Scaling Azure Virtual Desktops With Azure Native Qumulo Storage

Azure Virtual Desktop costs are too high if you launched using Azure Files.

Products

Use Cases

Industries

Partners

Get Started

Follow Us

Company

Qumulo Trust

Our Biggest Release