Nobody likes throwing things away, especially when that “thing” is data, which is how file systems get full. Sometimes file systems run out of capacity because of an engineering or user mistake, but often it’s just something that happens during a normal day. Admins typically don’t know the fine-grained value of the data the way their users do, so they can’t safely clean things up on the user’s behalf. But, at some point, something has to go.
The first challenge to regaining capacity is determining what to delete. To do that you’ll also need to find out where to look to find what to clean up! If you’re not familiar with recent activity in each directory structure (and who is?), you might try analyzing the file system with standard tools. This works great if the system only has ten thousand files in it. But what if it has ten or a hundred million or even a billion files in it? Assuming a single-threaded process, if each stat call takes a millisecond, a hundred million files takes about a day to visit and generates a steady load of 1000 IOPS. So, not only is your info a little old, but it takes a long time to receive, and that’s just at the top-level of your search! You will have to rinse-and-repeat as you descend into the file system.
Obviously, you need a solution with better performance. For instance, you might multi-thread the process. With twenty workers performing stat calls, all acting in parallel, you can reduce your day-long operation down to a little over an hour. The problem with this approach is that now you have a steady-state load on your system of 1000 times 20 workers, which equals 20000 IOPS! That’s a significant workload, and the important takeaway here is that’s 20000 IOPS that the production systems can’t use. All in the name of knowing where your capacity is.
We recently discussed capacity and other common storage pains during a recent webinar, which you can watch below:
Some ways to solve your pain of storage capacity management
When it comes to analyzing your capacity, there are a few standard techniques.
One technique is to make a full copy of the data in question as a backup and run stat calls against that metadata. This is not a terrible approach, because it uses the backup rather than the production system. While baselining the backup is expensive in terms of throughput, pulling just the changes from the production file system would be a reasonable compromise. Keep in mind that this technique does raise the cost of your backup tier because there is value in the software that does the analysis. If you roll your own, then you can keep the cost of this option down.
A different option is to get more aggressive about scanning and build that functionality into your storage system, which means you allow external systems to query that data or issue requests to gather that data. This approach is not bad either. Running a local job to gather metadata cuts down on the round-trip time for all those stat calls. You’ll use up some IOPS because a tree walk and-stat calls are still necessary, but the interface is more efficient than something like SMB or NFS.
Another approach is to use an external third-party system that scans everything you’ve got and gives you answers across the whole storage environment, including multiple storage vendors. If you have a lot of storage sprawl, a tool like this could help you get a complete picture and that is very valuable. A lot of tools that do this also have some kind of data management/movement capability. You could use what you learn about l your storage environment to set up policy-based movement of data between tiers or workflow steps. The downside of this approach is that those tools still have to scan to find changes, so you haven’t really removed the metadata IOPS load from the storage systems and you’ll still be a little behind in terms of updates.
Finally, you can do away with scanning and stat calls with files and directories that regularly update their parent directories, and store that data in the already-existing metadata database. This approach is actually a significant improvement because the update can happen in near real-time. If every object with fresh changes reports to its parent every 15 seconds, and if, for example, there is a directory tree that is eight levels deep, it will be two minutes for root to find out about an add or delete at the deepest level. That’s a lot better than an hour or a day! This is the approach Qumulo uses for its real-time analytics.
Another advantage to the Qumulo approach is that, no matter how much scanning you do and no matter how many stat calls you make, you still can’t easily answer that most important question, “Which data matters?” Everyone thinks their data is critical, but, with Qumulo, if someone disputes the importance of a project that’s due to be archived, you can use analytics data over time to show that it hasn’t been touched in months or years. That adds clarity to an otherwise murky storage decision. Conversely, this analytics data enables you to also show that sometimes, even though a file is old, it represents a data set that still gets used regularly.
Takeaways about storage capacity management
As with any engineering task, it is up to you and your team to determine which approach works best for your environment. If you are experiencing pain around your storage capacity, here are a few top-level things to think about:
- Don’t be afraid of new-ish vendors. Newer entrants to the market will probably have more modern ways of dealing with capacity analysis than older, more established vendors.
- Look for storage optimizations. Everyone scans, so look for a storage system with optimizations such as metadata caching, clever methods of pruning the search, and local scanning.
- Look for an API. If you value tight workflow integration, be sure you have programmatic access to the scanned data, somehow. An API is best, even if it can only query a database hosted on the storage system. You might want to integrate capacity data into your production management system or your media asset manager, and you want that analytics data to be easy to consume and manipulate.
- Use quotas or volumes. Use quotas or volumes to manage user behavior and to keep users from filling up your storage with their data. For example, Qumluo has directory-based quotas that can be applied in real time.
Mike is a Systems Engineer with over 15 years of experience in shared high performance mass storage systems primarily for TV/film, internet media delivery, and supercomputing applications. His specializations include shared filesystems, clustered filesystems, NAS, SAN, and RAID storage.