A Deep Dive into the Design of Flexible Clusters

This is part two of a five-part blog series that will take a closer look at our new suite of data services that will help our customers to radically simplify file data management at scale. We talked about NVMe Cached Performance in our previous post. Here, we provide an overview of Dynamic Scale. Future blogs in this series will go more deeply into the other new data services included in this announcement.

When designing Qumulo’s file data platform, we continuously take into consideration the problems our customers are trying to solve. Our customers are our magnetic field, and they drive our product development and innovation.

To help them in their missions, we rapidly release new software features and platforms that improve both costs and performance for our customers. In fact, we recently announced a new suite of data services to help radically simplify the way our customers manage massive amounts of file data.

Our goal is to make it simple and cost-effective for enterprise and government organizations to scale, secure, and achieve high performance in their complex unstructured data environments.

Innovation without limitation

To ensure our customers are able to take advantage of improvements in hardware technology without data migration, Qumulo is continually improving our file data platform so that customers can combine nodes of different types in a single cluster. We believe you should not be held back from having access to the latest technology.

We recently introduced Qumulo Dynamic Scale, which enables admins to leverage newly qualified hardware platforms with the latest processors, memory and storage devices without the need for forklift upgrades, data migrations, or complex storage pool management. Qumulo users now can add new qualified platforms into existing environments with no need to manage different storage pools or perform a data migration. The new platforms are simply added to the existing environment, data is automatically redistributed and the increased performance and capacity made available to users and applications automatically.

So what did Qumulo’s engineering team do to enable nodes of different types in a single cluster? It’s something we call “node compatibility.”

Our first step in this work was to support clustering nodes with similar amounts of storage, so customers using HPE Apollo servers could add HPE Gen10 nodes to their HPE Gen9 clusters.

Node Compatibility: Understanding the data protection architecture

From day one, Qumulo was focused on architecting and developing a file data platform that would provide our customers with the latest technology, without creating artificial limitations on the file data platform’s functionality. Qumulo’s file data platform is designed so the file system layer doesn’t need to know anything about the underlying dimensions of the hardware – all of that action takes place lower down the stack.

To make the different node types work together, the Qumulo engineering team needed to work on the components responsible for laying out internal system objects (not to be confused with end-user object storage) across the storage resources in the cluster. The layouts are optimized to protect against data loss, and ensure data availability. This is important so that in the event a disk fails, or a node goes down, the customer can continue using the cluster and continue to access their data.

The work focused on two key components: the restriper, which lays out the protected stores (or “PStores”) that store the file system data and metadata (part of the protection system on the diagram above); and the DKV, which provides a fast Distributed Key-Value store for use by the protection system and the transaction system.

The restriper

The restriper is responsible for making sure that data is safe after a component failure. It makes decisions about which disks the PStores live on in the system. A PStore consists of a number of Block Stores (“BStores”), each of which live on a specific disk. Data is encoded on the BStores using a scheme called Erasure Coding, so that the system can tolerate a certain number of drive and/or node failures, while providing as much usable space as possible.

In the event of a lost drive, the restriper is the component that deals with rebuilding the data that was on the lost drive. To that end, Qumulo keeps some sparing space on all of the other drives in the system for this purpose.


In the image above, we see a cluster with 8 x 8TB HDDs, each of which has a spare 1TB of space kept to one side. Should one drive fail, we can reconstruct the now-missing 7TB of data using the sparing space; reprotect times are minimized by having the sparing space evenly distributed in the file system.


The DKV stores key-value pair data that must be globally available, as well as writable and readable from any node. The values it stores are essentially a series of bytes, with the maximum size being just under 4 KiB. The main use of the DKV is to store PStore metadata, including which of the disks the BStores of each PStore live on.

The DKV key space is divided into a number of equally sized partitions that we call shards. Each shard is mirrored onto multiple KV volumes. When the Qumulo file system writes to a DKV key, it writes the data to all mirrors before the API call returns. That way when the system reads from a DKV key, it can read from an arbitrary mirror.

The DKV key space is divided into a number of equally sized partitions that we call shards. Each shard is mirrored onto multiple KV volumes, which are the physical artifacts that live on the SSDs in the cluster.


A quorum in Qumulo is the set of physical resources that the cluster is currently able to use – it’s the set of online nodes and disks. When that set of resources changes – say a disk needs to be replaced or a node goes offline – the current quorum is quickly abandoned, a number of tasks that are running in the system are stopped, and we then bring the system back online into a new quorum with the new set of resources. The Qumulo file data platform recognizes the changes in resources, and runs various recovery operations to make sure that the data on the cluster remains available, while still being protected against future resource failures.

Recovering the DKV after a quorum change is really simple. For each shard, we choose an arbitrary available mirror as our source, and make new, identical copies of that mirror for use in the new quorum. So, even if an in-flight write had only written to a subset of the mirrors when the previous quorum ended, the new set of mirrors will be consistent because they were copied from the same source mirror.

Qumulo only uses SSDs that support atomic 4 KiB writes (even in the face of unexpected power loss), so we can also guarantee that for any in-flight writes, we will strictly see either the old value or the new value – there won’t be any partially written blocks. The SSDs we use also don’t ever drop acknowledged writes (again, even in the face of power loss), so we can guarantee that we will see the new value for any writes that got recorded as completed.

Achieving node compatibility for the restriper

Now that we understand the low-lying data protection architecture, the challenge for the Qumulo project team was to get this architecture to work across different systems that a customer may include in a single cluster. Each node needs to find out the “geometry” of the cluster by having one node ask every other node to tell it how many disks it has, how big they are, etc.

Remember that spare space that we keep around on every drive? We need that space to be evenly distributed, so that the rebuilding of data from a lost drive can go as quickly as possible, without overloading individual drives in the system. The greedy algorithm in the restriper was written to be greedy in terms of the used space on the disks. So if the disks were different sizes, it would aim to place the same number of BStores on each disk.

But that’s clearly not what we want. This would cause reprotect to have to write more of the reconstructed data to the bigger disks in the system, because they have more of the sparing space. That’s either going to slow reprotect down, or we’d have to make a larger impact on the customer workload to keep the time the same.

node compatibility

Fixing this was quite straightforward – we just reworked the algorithm to be greedy in terms of free space. This approach has limitations, and only works for similar-sized nodes, but it was a quick incremental change that allowed us to deliver benefit to some customers quickly.

Achieving node compatibility for the DKV

As discussed, the main use of the DKV is to store data about where the PStores are located.

Historically, Qumulo’s engineering team chose to reserve a small number of keys at the beginning of the key space for other future use cases, but everything after a certain key is a PStore. When we first designed the DKV, we wanted to make the number of DKV shards a trivial function of cluster size, so we decided to make each shard big enough to store a set number of nodes’ worth of PStores. New shards get added to the DKV as nodes get added to the cluster, based on the number of nodes.

If we add different sized nodes to a cluster, this calculation won’t work the same way. When the new nodes are larger than the existing nodes, we’ll need to add DKV shards more quickly. But with only a single KV volume per SSD, this is not always possible, while retaining our ability to recover from disk and node losses.

In order for the DKV to be flexible about when we add more shards to account for different size storage nodes, we also needed to provision additional KV volumes on the SSDs in the cluster. Because the volumes themselves are small (less than 100 MB), we chose to use a constant number of volumes per SSD, even though this would now overprovision the space by a relatively larger margin in some cases. This simplifies the layout code, at the expense of consuming a small amount of SSD space.

Adding new KV volumes at DKV recovery time is not an option, so instead we provided an API which the restriper can call to make more space, after the system is up and running. This size negotiation happens when the restriper is getting ready to create PStores – either just after initial cluster creation or the addition of new nodes. Once the protection system has figured out how many PStores the system should have, the restriper calls the negotiation API to ask for the space. If new shards are needed, additional KV volumes will be added if required, and a layout for the new shards will be found and added to the current DKV layout. Once the new shard has been laid out the new shard is available for use cluster-wide. At the end of a successful negotiation, the restriper resumes control and adds the new PStores, which delivers the usable capacity to the customer.

It is interesting that although what happens under the covers is complex, the interaction between the restriper and the DKV is very simple. And the layers are no longer enmeshed. The DKV doesn’t know anything about how many PStores a particular cluster should have. New clusters just bootstrap a single shard which can store a set number of keys – all hardware platforms use the same initial volume size, and then when the protection system knows how many PStores it wants, it just asks for the space.

The design of the DKV provides a tremendous amount of flexibility – flexibility Qumulo uses to address our customers’ challenges. Combined with our simple improvement to the restriper, we were able to open a path forward so customers with HPE Apollo Gen9 clusters could continue to expand their clusters when they could no longer purchase Gen9 nodes.

Moving forward, we will be generalizing Node Compatibility so that all of our customers can take advantage of new technology, as it continues to come on to the market, and customers’ data volumes continue to explode, and we are confident that our flexibly designed systems can be adapted to handle whatever is next.


To see these new data services in action, sign up for our Q-Connect exclusive virtual event Dec. 8-16! At Q-Connect, attendees will have the opportunity to connect live with our technical experts and Qumulo customers to see how our file data platform radically simplifies enterprise data management.

Contact us here if you’d like to set up a meeting or request a demo. And subscribe to our blog for more helpful best practices and resources!

P.S.! Qumulo’s engineering team is hiring and we have several open job openings – check them out here and learn more about life at Qumulo.

Share this post