Many of you have seen a universal pain scale chart in a hospital or an emergency room. This graphic helps medical professionals assess the severity of their patients’ pain so they can guide their treatment and allocate resources appropriately.
In a couple of my previous jobs in professional services for some very large storage environments here in Southern California, I took to using a chart like this to triage the problems I identified when I was evaluating an environment. In terms of storage architecture, engineering and administration, pain is the qualitative measurement of impact to productivity for users.
For example, there might be a simple problem with an easy workaround that could be rated as a two, or something that “can be ignored.” But there might be a performance problem that rears its ugly head whenever the compute farm is running at full capacity, say when users are doing high performance workloads like legal discovery or genomic analysis. In this case, the pain is interfering with our ability to focus and get a project finished, putting the pain scale at a six (“interferes with concentration”).
Or, you might have a lingering bug in your storage system, causing it to go down randomly for an hour at a time. That stops data storage altogether and that gets pretty painful – that’s almost an eight (“interferes with basic needs.”) Then there’s a complete data loss, which is a 10 (complete bedrest required!).
The point is that there’s a lot of pain in your storage environment whether you’re an editor, a colorist, architect, artist, administration, or engineer. Your workloads are growing as the businesses try to squeeze more productivity and dollars from less gear, less talent, and less time. And your client demands are constantly increasing, whether they’re about project resolution, color depth, frame rate, shot complexity, or even turnaround time.
Here we’re going to be talking about how to stay pain free in the storage landscape. We’re going to explore some of the common storage pain points and discuss solutions for those. We’ll also discuss different tools and solutions that will address specific issues. The goal is to give you a new way of thinking so that you can resolve a particular source of pain.
At Qumulo, we’ve talked to a lot of businesses and users about their storage pains. We’re very data-driven and doing interviews with users has allowed us to discover what works, what doesn’t, and what needs to change. Our conversations and shown us that the most common sources of storage pain include the following:
1. Capacity pain (storage isn’t big enough)
2. Performance pain (storage isn’t fast enough)
3. Budget pain (storage is always too expensive)
4. Scaling pain (performance or capacity can’t grow effectively)
5. Legacy software pain (outdated systems impact user performance)
6. Data blindness (not knowing how your data is being used or what’s going on in your storage repositories)
7. Availability pain (storage lacks resiliency and goes down occasionally, impacting productivity)
8. Data loss pain (the worst case scenario)
For each of these sources, we’ll talk about how and why they manifest, what kind of pain they cause and how storage admins might deal with these pain points.
The oldest complaint of all time is, “we need more space!” This goes back to when we stored things in granaries – you always need more space for the things you want to store.
As an admin, have you ever had to deal with a completely full file system? Or as a user, have you ever had to stop what you’re doing and clean up your files, or wait for administrators to give you more space? Full file systems are a reality: sometimes it’s a user mistake, sometimes it’s an engineering mistake, and sometimes it just happens over the course of normal work.
No one likes throwing things away, and no one knows the fine-grain value of their data like users do. As a result, admins often can’t safely clean things up on the users’ behalf. Unfortunately this is usually the first step necessary to resume production.
The first thing to figure out is where the issue is by analyzing the directory structure. You walk a tree, stat everything they find, add up the capacity and it presents you with an answer. This works great if your file system only has 10,000 files in it, but if you have 100 million, or a billion, files, it’s going to be a pain. One hundred million files can take up to a day to visit and come back with an answer, and you might have to rinse and repeat that process as you descend into the file system on your hunt.
Some quick tips to address capacity pain:
- New entrants to the market are going to have more modern ways of dealing with capacity analysis than older ones – so don’t fear new vendors
- All vendors provide scans at some level – so look for optimization.
- Look for API access to the metadata – If you value glue, or tight workflow integration, make sure you have programmatic access to that scan data somehow. So you might integrate that capacity data with your production management system, or your media asset manager, or network monitoring system, and so on. You want that analytics data easy to consume and manipulate.
- Use quotas or volumes to help with user behavior to help keep users in check that might be filling up your storage with their giant personal movie collections, for example.
Performance can be a pretty nebulous term but when storage people talk about it, it’s generally in terms of throughput, iops, or latency. You have to come up with a balance between users that will be very sensitive to latency, and render farms, which are focused on throughput. Here, we find that NAS has really started to catch up, with faster hardware, flash, better data layout techniques, better protocol approaches – all of those things are helping NAS chip away at the SAN/bandwidth requirements. I think you’re going to see more and more business looking to go with the simplicity of NAS versus the complexity of SAN.
Other remedies for performance pain:
- Try and tackle potential performance issues in advance. When possible have a good understanding of your expected workflows before chopping up the infrastructure.
- Make sure that you’ve chosen a system appropriately sized for your system – you can save a little money with a scalable system and only purchasing the storage you need right now. When possible, try to calculate the likelihood of whether you might need additional headroom and when you’re going to need more.
- If have a very heavy workload outlier, like a single high-speed workstation, see if you can solve it with a point solution. One work station, for example, shouldn’t be the driver for you to go and purchase a massive amount of storage. It’s simply going to be wasted for the majority of your workloads.
As we all know, money isn’t infinite and storage isn’t free, and even free software needs (not-free) hardware and engineers to run it. Storage capacity has a clean cost to it, and that cost is always going to be perceived as being too expensive, always. Too often I find that folks get hung up on dollars for capacity, and ignore others things like dollars for throughout, or dollars for iops.
- Use the right storage technology for your workflow. Using flash when you don’t need it is wasting money. Using disk when you need flash simply won’t work.
- You also will want to engage integrators or vars – they spend a lot of their time talking to vendors and understanding the marketplace, and can add value when evaluating systems. Don’t stand for VARs that don’t add value!
- You’re going to enter a buying process based on your previous storage experience and things are changing ridiculously quickly – what you thought six-12 months ago is likely not true today. Do the research when planning a new storage buy or a facilities buildout.
If your business is growing, your workloads are probably also growing. Scaling storage is interesting, you have to balance a lot of things against your workloads that can make the system pretty unusable. Most systems aren’t very easy to expand and a lot are really hard to make big in the first place. Over the past decade we’ve made a lot of progress with storage systems.
If you have unpredictable workloads, look for ease of scale as a key value.
Legacy software pain
Contrary to what they might tell you, large established storage vendors are no longer risk-free. You want to value customer support very highly – if you’ve got tight deadlines or large data sets, look at the track record of how they have helped to address problems.
- Don’t be afraid to investigate software development. Talk to existing customers about how accurate the engineering roadmap has been.
- Measure your predicted needs from the roadmap. You’re going to be putting your crown jewels on the storage system you purchase and the bigger the system is, the longer it’s going to be there. Your chosen vendor(s) need to be moving in the same direction you are.
A lot of storage systems aren’t great at managing systems. Storage is kind of dumb, or mute for a better word; most storage doesn’t tell you about the data inside it or what clients are doing to it right now. It’s possible to get answers to those questions by other means – but all of those introduce complexity.
- Storage should be able to answer questions for you: What is consuming all of this throughput? Where the heck did my capacity go on Sunday? What is eating up my capacity right now? What do I need to back up? What can I safely archive? When am I going to need more storage?
- Most higher-end storage providers offer visibility tools. In any case, storage is the thing best positioned in your data center to tell you things about itself and the things that access it.
- Research visualization tools – do they answer the questions you’ve had recently when evaluating your incumbent solution?
- If you value integration with your management system, you should demand API access. If your storage vendor isn’t providing access to it, they should.
When data isn’t available, work is stopped. There’s a cost-of-work stoppage and that’s more so if you have a large team of creative or technical people that get blocked by unavailable storage.
With a monolithic system, availability can be dicey. Often you’re going to need to buy two systems and then add a software layer that can move a workload between them in the event of a failure. You’re going to want to look carefully at the expense of adding redundancy to a monolithic system and make sure that insurance policy is worth it.
If your cost in downtime is very high, it might be worth buying two systems just for the sake of redundancy. But if your cost of downtime is flexible or low, it might not be worth that added system – so think about it if you’re considering your backup.
Another option, a compromise of sorts, is to look for a solution with a higher-end service contract, with bigger SLAs. For a scale-out system, a single node failure doesn’t take the whole system down, so you have some protection inherent in the architecture – but you could still have a cluster go down due to a network problem. In either case, you’re going to have a backup for recovery, and business continuity. There’s always inherently a high value in buying two.
Data loss pain
It hurts me to say it. I cringe even to say the words ‘data loss.’ To state the obvious, data protection is very important. In this industry, the data is the actual thing we work on and modify – so lost data is lost time, lost money, lost jobs. You’re going to want to look at systems that protect your data really well.
- Rebuild performance should go up as drive population goes up. If it goes down, it’s moving in the wrong direction. It needs to increase as drive numbers goes up. There needs to be some sort of parallel rebuild system.
- Stay on the minimum protection level possible – don’t jump in at a low level with the goal of wanting to protect everything; it’s going to cost you and increase the cost of small random writes.
- There are some object and scale out systems out there that do per-file data protection. Avoid that if you can. If you have a small file count, it might not be a big deal, but as file counts grow, that strategy won’t pan out.
Learn more about reducing storage pains
If you’d like more info about these pain points – check out this one-hour webinar from Qumulo that provides a deeper dive into the common issues and ways to mitigate data headaches.
We’re here if you have questions, so let us know if you’d like to meet up and learn more.
Mike is a Systems Engineer with over 15 years of experience in shared high performance mass storage systems primarily for TV/film, internet media delivery, and supercomputing applications. His specializations include shared filesystems, clustered filesystems, NAS, SAN, and RAID storage.