A Storage Nightmare: The “Small File” Taxman Comes Knocking!

It was a dark and stormy night.

The phone rings. It’s the office.

Your heart drops with surprise and horror.

Your systems are entirely out of storage…

This is a nightmare scenario for many storage system administrators. You do everything you can to understand your storage requirements, you buy the storage you think you need – even a bit more (or a lot more) so you can have some headroom for unforeseen scenarios – yet you still get surprised and run out of storage.

Managing small files

One thing that often catches system administrators by surprise is what we call the “small file tax.” It turns out that legacy scale-out storage systems don’t do a very good job when it comes to managing small files. By small, we mean anything under 128KB. Small files consume two to three times the storage you would expect – that’s a pretty hefty tax if there are a lot of them.

This is because these systems are based on a decades-old design that forces them to mirror (or double mirror, sometimes even triple mirror) files under a 128KB threshold. Not only does small file mirroring use extremely inefficient encoding, the space needed for it is deducted from what the vendor often reports as usable capacity. A previous blog post provides more detail on this (“Can I Really Use 100% of My Capacity? With Qumulo the Answer is Yes!”).

Mirroring is grossly inefficient because it simply creates two or three full copies of the data being protected that reside on different disks. While this is effective in terms of ensuring data protection, it reduces the available storage by half in the case of double mirroring, and by two-thirds in the case of triple mirroring. At terabyte scale, this is incredibly inefficient; at petabyte scale it is mind-boggling that a vendor would require you to use one-half to two-thirds of your storage for data protection.

There is a way to end this nightmare.

Data protection at the block level vs file level

At scale, it’s inefficient to protect small files simply by creating copies. Qumulo understood this early on. We developed a fundamentally different approach to data protection, protecting at the block level versus the file level. Working at the block level rather than the file level using our custom erasure coding makes it possible to protect data effectively without having to create a one-to-one copy of the entire data volume.

Operating at the block rather than the file level means you don’t have to protect each file individually. Instead, block data is encoded into partially redundant segments that are stored across separate physical media.

When managing small files, block-level protection delivers storage efficiency up to 40% beyond file-based protection. You even get a 20% increase in efficiency on large files. In fact, you can store billions of small files just as efficiently as large ones.

Small file tax (before and after migration example)

Here’s an example of the small file tax, taken from a real Qumulo customer site.

Cloud Storage Migration before and after

Legacy Competitor System Before Migration | Qumulo Hybrid Cloud File Storage After Migration

This customer migrated about 30 million small files to a Qumulo cluster from a legacy storage cluster. The dialog box on the left (Legacy Competitor System – Before Migration) shows the amount of space those files took up on the legacy vendor’s system, which mirrors small files.The dialog box on the right (Qumulo Hybrid Cloud File Storage- After Migration) shows the amount of space the files take up on the Qumulo cluster.

In this real-world example, you can see the result of the legacy vendor’s small file tax: storing these files consumed usable space by more than three times the user file bytes stored! It took 33.2TB of usable capacity to store 9.33TB of file data. On the Qumulo cluster, it took only 9.49TB. Qumulo eliminates the small file tax and stores small files as efficiently as large files.

What impact do small files have on data storage?

You might be asking yourself, “what impact does this small file tax have on my storage?”

With legacy systems, it’s impossible to say how much storage you’ll use unless you know in advance the exact size of each file that you plan to write, see how many fall below the 129KB threshold, and then do the math on each file. Talk about a nightmare, especially when you are dealing with billions of files!

As a result, it’s impossible to know how much usable capacity you actually have—or when you’ll run out. Instead, you’ll have to over-provision to make sure you’re covered. That means you’re actually wasting money in two ways: one, for the “usable” capacity you’re losing to the small file tax, and two, for the additional capacity you’re buying.

Qumulo makes it much simpler to estimate how much storage you’ll need. Instead of hoping for the best, or wrestling with complex estimations of the mix of large and small files in your workloads and hoping they’re not too far off the mark, you can just look at the web UI to see how much space is available. Your stored files will take the same amount of space regardless of how many are large or small. No “small file tax.” No surprises. No over buying. No over-provisioning.

Qumulo also provides the ability to monitor real-time performance, capacity, and usage, even for file counts numbering in the billions. With our real-time analytics, you can gain insights and prevent issues before they occur. Further, you can efficiently plan for future growth. Up-to-the-minute analytics allows administrators to rapidly pinpoint problems and effectively control how storage is being used.

Evaluating storage solutions

When evaluating data storage solutions, make sure you understand the data protection implications (AKA the small file tax) on small files. Ask if they mirror small files and if so, how many times. Understand if you will be buying twice as much, or even three times as much storage as you actually need.

End your storage nightmare now. Seek a file storage solution that efficiently manages your data no matter what size of files you have. Qumulo delivers the transparency, predictability, and performance you need for modern digital-era data storage.

Share this post