Real-Time Analytics: A Game-Changer for Managing Billions of Files

Authored by:

NPR recently reported on a fascinating new method by California scientists to count the marine population in the state’s protected waters — by performing DNA testing on just a liter of seawater. The old way of conducting a marine census was for a diver to record the number of fish and other wildlife on a waterproof clipboard. The new way uses genomic sequencing to detect the DNA left by marine life and determine what species have been in the area.

The story is a vivid example of a traditional, “dumb” approach to a massive data challenge being replaced, thanks to technology, by a new method that unlocks the value of data in radically more insightful, efficient and cost-effective ways.

A similarly dramatic transformation is playing out in big data storage.

Evolution of Big Data storage: a brief history

For decades, storage essentially has served as little more than a dumping ground for data. In the ‘90s, there was block storage for highly transactional data and file storage for unstructured and departmental data. Network attached storage emerged and improved performance and scalability. When NAS scale-out file storage couldn’t keep up with the capacity needed for web-scale requirements, object storage and flash became popular.

But despite all the advances, enterprise storage has struggled to provide a performance level that can meet the needs of Big Data and AI workloads. And it hasn’t been able to answer basic questions for organizations: What do I actually have? Where is my performance going right now? What has driven growth over the last six months? What is going to drive growth in the next six months?

In interviews Qumulo conducted with more than 600 storage administrators, buyers and users, we found two things keep them up at night most: how to manage data growth and a lack of understanding about all this data.

This is why real-time analytics is one of the primary benefits customers derive from Qumulo Core. As the world’s first and only solution that builds real-time file-system insight directly into a software-only scale-out NAS, Qumulo Core enables the management of billions of files and petabytes of data by making data visible through real-time capacity and performance analytics.

Managing billions of files without affecting file system performance

By offering real-time analytics that aggregates metadata on a massive scale (tens of billions of files and many petabytes of storage), Qumulo Core deciphers what previously have been mysteries — what the growth is, where performance is going and what the storage footprint looks like over time.

A problem with traditional file systems is that manual or even automatic processes for understanding details about data stored – like tree walks, metadata scans and file system lookups — can be time-consuming and greatly impact performance. Qumulo leverages a flash tier as part of its flash-first hybrid design and updates file metadata analytics in real-time without effecting file system performance.

Evolution of real-time analytics in file storage

Real-time metrics are surprisingly difficult to obtain from traditional storage systems. When file systems were designed decades ago, they only had to walk the directory and “stat” a few thousand files to obtain disk usage and other analytic data. This could be accomplished relatively quickly. Eventually, scale-out file systems came along and we had hundreds of millions of files to stat, which led to problems.

Assume it takes 5ms to stat a file (which is common with HDD-based files) to get analytic data. With a million files it takes 1.4 hours to walk the directory; if you have a billion files it takes 57.8 days. Various techniques have been devised to speed up the process, but these have issues. The basic problem is that traditional file systems and POSIX commands were not designed to deal with the sheer number of files that are stored on today’s file systems.

“Even one of the most trivial tasks — determining how much space the files on a file system are consuming — is very complicated to answer on first-generation file systems,” analyst firm the Taneja Group says. “Second-generation file systems need to be designed to be data-aware, not just storage-aware.”

Qumulo modernizes file system storage with real-time analytics

Qumulo Core’s real-time analytics help businesses obtain instant answers about their data footprint by explaining usage patterns and which users or workloads are impacting their performance and capacity.

Qumulo Core is powered by QSFS, the Qumulo Scalable File System that integrates scalable analytics directly into the file system itself. Qumulo Core can report analytics for millions and billions of files stored on their file system in real-time, not hours or days.

That’s not mere theory – Qumulo customers are enjoying these benefits today. Like the scientists in California, they’re seeing the amazing benefits of using a data-aware approach to better understand their environment.

Related Posts

Scroll to Top