A genome is the entire set of hereditary instructions for building, running, and maintaining an organism, and passing life on to the next generation. Genomic sequencing figures out the order of DNA nucleotides, or bases, in a genome—the order of its bases—the As, Cs, Gs, and Ts-- that make up an organism's DNA. The human genome is made up of over 3 billion of these genetic letters.
Genomic sequencing has undergone a dramatic transformation in the past decade. New techniques have been developed that are collectively referred to as “next-generation sequencing” or NGS. Compared to first-generation sequencing (“Sanger sequencing”), NGS has much higher throughput of genetic sequences, automated production and drastically lower cost. Using NGS, an entire human genome can be sequenced in a single day. In contrast, first generation techniques required over a decade to deliver the final draft of a single human genome. Estimates for how much it cost to map that first genome go as high as 3 billion dollars. Today, it would cost around $1,000.
Better, faster and cheaper genomic sequencing means that its impact on our lives is much greater. Researchers now are able to compare large stretches of DNA from different individuals quickly and cheaply. Such comparisons can yield an enormous amount of information about the role of inheritance in susceptibility to disease and in response to environmental influences. In addition, the ability to sequence the genome more rapidly and cost-effectively creates vast potential for diagnostics and therapies.
More concrete examples are the types of genetic tests that are becoming routine. Many people have genetic carrier tests to check for disorders that they can pass on to their children. Other tests can determine hereditary risks for certain types of cancers.
Aside from the “This is so cool, I have to tell you about it” factor, why am I blogging about genomic sequencing?
The DNA fragments from biological samples are extracted by machines called sequencers. The whole genome can't be sequenced all at once because the methods we have today can only handle short stretches of DNA at a time. Consequently, those sequencers produce lots and lots of small files. The raw image files are usually TIFF files, about 1KB apiece, with a total of 2-5TB per sample.
Any machine that produces so many small files is going to need a storage system that has great performance and that stores and protects small files efficiently. Techniques such as mirroring can waste a lot of disk space. Wasted disk space means companies have to buy more storage, use up more rack space and pay more for infrastructure costs such as power and cooling.
QF2 is way more efficient at representing and protecting small files than legacy scale-out NAS, typically requiring one third the storage capacity and half the protection overhead.
The process of refining the raw data—transforming the fragmented rough draft into a long, continuous final product without breaks or errors—is called finishing. Finishing involves different types of analyses, including hooking all the individual reads together into the proper order, checking for mistakes and gaps, and looking for differences between the final result and a reference genome. All these steps produce different types of files and all these steps require excellent I/O performance for fast analysis.
Fast I/O matters if there are lots of researchers on the other end of the workflow who are using the finished data for their own projects. Downstream researchers want to do their work in real time, not wait around because their own compute resources are starved of data.
QF2 provides two times the price performance compared to legacy storage systems.
Very few organizations have just one sequencer. They have rows of them, all producing TBs of data a day. Even just a few sequencers can produce over a billion files a year, taking up 1-2PB of storage. Different stages of the analyses are also stored for different amounts of time. While the raw TIFF files may only be stored for a few weeks, the other types of files may be stored for years. Huge volumes of data mean that the file storage must easily scale and, even better, adding a node should not only add capacity but performance.
With QF2, you can use any mix of large and small files and store as many files as you need. There is no practical limit with Qumulo’s advanced file-system technology. Many Qumulo customers have data footprints in excess of a billion files.
When you have billions of files in a storage system, you need a way to manage them. Sequential techniques such as tree walks don’t work anymore. Getting information about the data can take days or even weeks, which means it’s useless.
QF2 gives real-time visibility into the data and makes it easy for administrators to find out answers to questions like where the I/O hotspots are and take instant action.
Many sequencers send their data to storage over SMB but many researchers access the data over NFS. A storage system needs to support multiple protocols.
QF2 supports SMB, NFS, FTP and REST.
Organizations are looking to the cloud to give them more compute resources for their analyses. They’re hampered because many of the options for file storage in the cloud have poor scalability and performance.
QF2 for AWS has the highest performance of any file storage in the cloud, as well as being the most scalable. Unlike other options, performance and capacity can be scaled independently.
QF2 uses continuous replication to move data where it’s needed, when it’s needed. QF2 takes your latest changes and replicates them without you needing to worry about it. Continuous replication means you can easily transfer data from your on-prem QF2 cluster to your QF2 cluster in AWS, perform your analyses, and then transfer the results back to the on-premises storage.
If you’re in a research group or company that’s doing genomic sequencing, make sure you ask the right questions before you buy a file storage system.
Contact Qumulo and find out how QF2 can fit into your sequencing workflows.
If you’re interested in learning more about how the QF2 architecture can save you money while giving you capacity and scalability, read the Qumulo File Fabric Technical Overview.
Tony works to help organizations in the Great Lakes region solve the challenge of unstructured data growth, focusing on life science and automotive use cases.
We are always looking for new challenges in enterprise storage. Drop us a line and we will be in touch.
Enter a search term below