Genomic sequencing has undergone a dramatic transformation in the past decade. New techniques have been developed that are collectively referred to as “next-generation sequencing” or NGS. As NGS continues to evolve, the storage and data management systems that support these growing capacities, must rapidly evolve as well.
IT administrators are under pressure to find ways to increase efficiencies within their storage infrastructures
The DNA fragments from biological samples are extracted by machines called sequencers. Next-generation sequencing has much higher throughput of genetic sequences, automated production, and drastically lower cost than first-generation sequencing. Using NGS, an entire human genome can be sequenced in a single day.
As sequencers have become more advanced and cost effective, the number of studies continue to grow, and more data is produced. These sequencers can produce billions of small files, thus the file system used to manage these massive capacities of small files needs to be fast, easily scalable, and efficient regarding both the storage and protection of data, to meet research budgets and support new research projects.
Helping Progenity Speed Diagnostic Tests and Information
Progenity, Inc. is a biotech company that provides clinicians with complex molecular and specialized diagnostic tests for women’s health, reproductive medicine, and oncology.
Over the years, the company’s work in genetic sequencing has generated more than a billion files. According to David Meiser, Solutions Architect for Linux and Windows applications at Progenity, “That pace is accelerating. Within two years, we might have another billion files.”
“One problem that was always present was that there was significant file overhead,” said Meiser, “The files we write are very small, and the block size of our old storage system was very large.” Further, Meiser explained, “We found that we couldn’t do analysis in-place because the access times were super high.”
Legacy file systems, which are based on 15 or 20-year old designs, cannot meet the demands of modern NGS workflows.
Too often, IT organizations are now forced to use different solutions for different parts of their NGS workflows to compensate for the inefficiencies in their legacy systems. This is problematic for several reasons:
- Multiple systems add complexity, which translates into higher overall operational costs.
- Multiple systems can also cause data silos, so that one group of researchers may not be able to access data another team is using.
- Lack of collaboration can slow down how long it takes to get results, which can delay the time it takes to complete projects or get a product to market
With its rapid growth and data-intensive workflows, Progenity knew that its legacy system vendor would be unable to meet its future needs. “After a few years with our original storage system, we realized that the way the company worked wasn’t a good model for us,” said Meiser, referring to both high costs and storage efficiencies.
On-Prem and Cloud-Based NGS WorkFlow Configurations
Qumulo’s file data platform meets the performance and capacity demands for storing, managing and accessing genomic sequencing data, on-prem or in the cloud. It manages billions of small and large files, and supports a variety of protocols including SMB, NFS, FTP and REST, which means that all phases of the genomic analysis workflow can use the same Qumulo cluster.
Below is an example of an on-prem NGS workflow configuration.
This example shows the DNA sequencers generating many small BCL files or base calls, which are unordered DNA sequence fragments. A process of demultiplexing assembles BCL files into a FASTQ file, which is a text file that stores the combined output results of the BCL files along with corresponding quality scores.
The compute farm performs alignment and variant calling. In alignment, sequence fragments are quality checked, preprocessed and aligned to a reference genome. A BAM file is a binary file that stores this alignment data. Variant calling looks for differences between the data and the reference genome. Results are stored in a VCF file.
Once these data stores are ready, they can be used for application-specific analysis, which is done by researchers for their own projects. For example, a researcher might be working on a targeted therapy for patients with a tumor that has a specific gene mutation. Researchers may use all the data that is generated contained in the BAM and VCF files.
Here is a workflow example that shows how to perform analysis in the cloud with Qumulo for AWS and EC2 spot instances.
In this example, through continuous replication, the Qumulo cloud cluster on AWS and the local Qumulo cluster are always in sync. An organization can take advantage of EC2 spot instances to keep costs down.
Learn more
Qumulo has several helpful resources for learning more about genomic data and sequencing and how our file data platform helps organizations to store, manage and access genomic sequencing data on-prem and in the cloud. Read our solution brief here, and check out our on-demand webinar, “Accelerating Genomic Research with Hybrid Cloud Solutions.”
Contact us here if you’d like to set up a meeting or request a demo.