The Most Common Storage Pains of Large-Scale Environments (And How to Resolve Them)

Authored by:
In this three-part blog series, I discuss the ten most common file data management pains of large-scale storage environments. But first, a little context.

In this three-part blog series, we discuss the ten most common file data management pains of large-scale storage environments. But first, a little context.

The universal pain scale for very large file storage environments

Many of you have seen a universal pain scale in a hospital or an emergency room, where your doctor asks you, “On a scale of 1 to 10, how badly does it hurt?” How you answer helps medical professionals assess the severity of your pain, so they can prescribe treatment and allocate resources appropriately.

I’ve found charts like this also help triage the problems identified when evaluating very large file storage environments. When looking at a storage architecture, engineering, and administration–the pain is the qualitative measurement of the impact on its users’ productivity.

For example, there might be a simple problem with an easy workaround that could be rated as a 2 or something that, “can be ignored.” But there also might be a performance problem that rears its ugly head whenever the compute farm is running high-performance workloads like physics simulation, legal discovery, or genomic analysis. In this case, the pain is interfering with a user’s ability to focus and get a project finished, putting the pain scale at a 6 (“interferes with concentration”).

Or, you might have a lingering bug in your storage system, causing it to go down randomly for an hour at a time. That stops data storage altogether and gets pretty painful – that’s almost an 8 (“interferes with basic needs.”) Then there’s a complete data loss, which is a 10 (“bed rest required!”).

The point is, there can be a lot of pain in your storage environment whether you’re an editor, scientist, cloud architect, artist, storage administrator, or engineer. Your workloads are growing as the organization tries to squeeze more productivity and dollars from less gear, less talent, and less time. And the business demands are constantly increasing, whether they’re about project size, total performance, or even turnaround time.

Exploring the 10 most common storage pains, a few at a time

At Qumulo, we talk to a lot of enterprises, organizations, and users about their data storage pains. We’re very data-driven and interviewing business leaders and users helps us discover what works, what doesn’t, and what needs to change. Our conversations have shown us that the most common sources of storage pain include the following ten.

  1. Capacity pain (storage isn’t big enough)
  2. Performance pain (storage isn’t fast enough)
  3. Scaling pain (performance or capacity can’t grow effectively, both on-prem and in the cloud!)
  4. Legacy software pain (outdated systems impact user performance)
  5. Availability pain (storage lacks resiliency and goes down occasionally, impacting productivity)
  6. Budget pain (storage is always too expensive)
  7. Data blindness (not knowing how your data is being used or what’s going on in your storage repositories)
  8. Data loss pain (the worst case scenario)
  9. Data locality pain
  10. Data migration pain

Dealing with storage capacity, performance, and scaling pains

For each of the sources of pain above, I’ll discuss why they manifest, what kind of pain they cause, and how storage admins can resolve specific issues.

1. Storage capacity pain–storage isn’t big enough

The oldest storage complaint is “we need more space!” This goes all the way back to ancient times when we stored food in granaries – you always need more space for the important things you want to store.

As an admin, have you ever had to deal with a completely full file system? Or as a user, have you ever had to stop what you’re doing and clean up your files, or wait for administrators to give you more space? Full file systems are a reality: sometimes it’s a user mistake, sometimes it’s an engineering mistake, and sometimes it just happens over the course of normal work.

No one likes throwing things away, and no one knows the fine-grain value of their data like users do. As a result, admins often can’t safely clean things up on the users’ behalf. Unfortunately this is usually the first step necessary to resume production.

The first thing to figure out: Where the issue is in the tree by analyzing the directory structure. There are some common tools that do this: du on a linux box, Get Info on a Mac, or Right-click->Properties on a Windows box. All of these tools walk a tree, stat everything they find, add up the capacity, and finally present you with an answer. This works great if your file system only has 10,000 files in it, but if you have 100s of millions, or even billions of files, it’s going to be a pain. One hundred million files can take up to a day to visit and come back with an answer, and you might have to rinse and repeat that process as you descend into the file system on your hunt.

Some thoughts on addressing storage capacity pain:

  • Look for modern ways of analyzing capacity. Traditional tools have to scan, leading to unnecessary IO operations and long wait times for answers.
  • Make sure the storage system you’re considering has programmatic access to capacity metadata somehow, preferably via an API. You might integrate that capacity data with your production management system, or your media asset manager, or network monitoring system, and so on. You want that capacity data to be easy to consume and manipulate.
  • Use quotas or volumes to help control user behavior such as filling up your storage with endless copies of their working data or their giant personal movie collections.
  • Look for systems that can scale capacity transparently and easily (more on that in a bit!).

Storage is critical to our business, which is basically a fire hose of data. We could not do our work without a high performance, high-density scalable solution of some kind.
Nathan Conwell, Senior Platform Engineer, Vexcel Imaging

2. Storage performance pain–storage isn’t fast enough

Performance can be a pretty nebulous term but when storage people talk about it, it’s generally in terms of throughput, iops, or latency from a single system or a population of systems. You have to come up with a balance between users that will be very sensitive to latency, and compute farms, which are usually focused on throughput so that they can fill memory spaces with things to compute against.

Ultra-high performance used to be the sole domain of shared SAN and parallel high speed file systems. Today, we find that NAS has really started catching up. We have faster hardware, flash storage, better data layout techniques, better protocol approaches – all of those things have helped/are helping NAS chip away at the SAN/bandwidth requirements. I think you’re going to see more and more stakeholders preferring the simplicity of NAS to the complexity of SAN.

Other remedies for storage performance pain:

  • Try to tackle potential performance issues in advance. When possible have a good understanding of your expected workflows before chopping up the infrastructure.
  • Make sure that you’ve chosen a system appropriately sized for your system – you can save money with a scalable system and by only purchasing the storage you need right now. When possible, try to calculate the likelihood of whether you might need additional headroom and when you’re going to need more.
  • Beyond the above, you may need a system you can spin up and spin down. If you plan to go days or weeks between projects, and you don’t need the storage system for anything else, it might make sense to consider an on-demand, public cloud working model leveraging remote access, rather than an on-premises installation.
  • The need to support a remote workforce is another factor driven by the global pandemic. The Media and Entertainment Industry has been experiencing extreme demand. To meet production deadlines and enable creative teams to collaborate virtually, many studios looked to the cloud for remote video editing on virtual workstations in a post-production environment.
  • If you have a very heavy workload outlier, like a single high-speed workstation, see if you can solve it with a point solution. One workstation shouldn’t be the driver for you to go and purchase a massive amount of high-speed storage. It’s simply going to be wasted on the majority of your workloads.
  • On the other hand, if you have a lot of workloads to consolidate, consider the storage efficiency gain of combining low performance and high-performance workloads in the same system. You get the benefits of storage efficiency from a larger system without negatively impacting either workload.
3. Storage scaling pain–performance or capacity can’t grow effectively both on premises and in the public cloud

Scaling data storage is interesting. If your business is growing, your workloads are probably also growing and that means you have to balance a lot of considerations against your workloads when thinking about growing capacity or performance.

While the last decade has seen improvements in scaling storage file systems, most aren’t very easy to expand and many are really hard to make big in the first place. Let’s look a some specific issues:

  • If you have unpredictable workloads, look for a file system that is easy to scale so you can take on expanding workloads or new workloads with confidence.
  • Understand your workloads. Know what the true infrastructure cost is of your workflows and processes (i.e. capacity, performance, connectivity requirements). When the business comes to you with an expansion requirement, you’ll be able to confidently size your infrastructure expansion to accommodate.
  • Consider if it makes sense for some of your workloads to run on the cloud. If your file system supports a hybrid cloud strategy you can take advantage of the performance and capacity of the cloud to burst workloads when needed.

“Our team has been able to sustain burst scaling at a rate of 1.3 million IOPS for upwards of 5 hours at a time, with peaks as high as 2 million IOPS. This is a level unheard of in the past, and it highlights how much Qumulo has helped us to condense our production timelines when required and allow artists to have more iterations in less time, overall resulting in higher-quality final work.”
Jeremy Brousseau, Head of IT, Cinesite Vancouver

Coming next: The Legacy Software, Availability, and Budget Pains

In the next article, we’ll explore three more of the 10 common storage pains of very large file storage environments. These are the pains of outdated systems on users’ performance, lack of availability on their productivity, and the cost of expanding storage.

Qumulo’s modern file data management and storage software was purpose built to support hybrid cloud strategies for high-performance workloads at massive scale.

0 0 votes
Article Rating
Subscribe
Notify me about
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Related Posts

0
Would love your thoughts, please comment.x
()
x
Scroll to Top