Today, I’d like to talk about resilience to failure, which is a major theme of a recent patent Qumulo has been awarded.
In Qumulo’s distributed system, we have two resources of critical importance for reliably storing our users’ files. First, is storage media. This may be SSD, HDD, or even cloud blob storage. For my purposes here, we’ll just call any independently addressable and independently failing unit of storage a disk. Next, we need a network attached server (compute), which we use to run our software, as well as address both the user and the disks. This might be a physical machine in a data center, or a virtual machine in the public cloud. Together, we call all the servers and storage available to our software the cluster.
All hardware fails, and all services can lose data. It is Qumulo’s responsibility to maintain full operability through any reasonable combination of failures, and to never lose customer data. For each cluster, we define some concurrent server and disk failures, which we must support based on the customer’s needs. We then use a technique known as erasure coding to protect customers from both server and disk failures. I highly recommend you read our post on erasure coding if you have not, but I will summarize the most important points here.
A key concept in fault-tolerant redundant file systems like Qumulo is “efficiency”. This refers to the amount of data a customer can store on the cluster after discounting for space needed by the file system software. On-prem, disks are always attached to servers, and therefore if the server was unavailable, so was the data. That limits the maximum efficiency of any encoding based scheme used to protect data. For example, a four server cluster can only have data encoded with 75% efficiency if the customer wants to keep operating with read-only access to all data with 1 server offline. If we want to write to some fraction of the storage offline, we become limited to encoding with no more than 66% efficiency.
At moderate scales, an eight server cluster allowing for two concurrent server failures has the same encoding efficiency restrictions as the four server cluster allowing for one server failure. You can proportionally extrapolate this argument to larger and larger cluster sizes.
Our solution to this problem was to carefully model the mean time to data loss (MTTDL) with various combinations of server and disk failures, then to only allow configurations that exceed a very high threshold. On large clusters, encoding efficiency can be as high as 90%.
That solution is as close to optimal as possible on-premise, but we can do better in the cloud. In the cloud, Qumulo uses remote storage. Qumulo on Azure as a Service uses Azure Blob Storage for its storage needs. Since Qumulo’s storage is not physically in the server, our service can apply all the same logic above to disks instead. Because there are proportionally many more disks than servers in a typical cluster, our service can reach much higher efficiency at small scales. We could theoretically hit that 90% efficiency marker on our smallest scale Azure cluster. On clouds with reliable storage media like Azure, we may even hit 95% efficiency.
Our engineering team recently delivered a product improvement leveraging this patented approach. “Floating disks” enables Qumulo as a Service to reassign remote storage to a different server if the server previously hosting it goes offline. This will allow us to support anything less than 50% of servers failing simultaneously, while allowing us to be much more efficient at small scales. This is what it looks like to build a truly cloud native storage service.