Qumulo’s Distributed File System
Data protection based on erasure coding
Protecting against data loss in the event of disk failure always includes some form of redundancy, or duplication of information across storage devices.
The simplest form of data protection is mirroring. Mirroring means that there are two or more full copies of the data being protected. Each copy resides on a different disk so that it is recoverable should one of the disks fail. Mirroring is simple to implement, but has disadvantages compared to more modern data protection techniques. Mirroring is wasteful in terms of the space required for data protection; further, it only handles a single disk failure, which generally isn’t a high enough level of safety as node density and cluster sizes increase.
Other strategies for data protection include RAID striping. RAID requires extremely complex administration, and slow rebuild times force the admin to choose between either unacceptably long rebuild or unacceptable storage efficiency.
SBS implements its block-based data protection with an efficient technique known as erasure coding (EC).
EC is faster, more configurable and more space-efficient than alternatives such as mirroring and RAID striping. EC encodes block data using redundant segments that are stored across different physical media. Because of EC’s efficiency, more of the disk is available for data as compared with RAID and mirroring schemes, and this lowers the cost per usable terabyte.
EC can be configured with tradeoffs for performance, recovery time in the case of failed physical media, and the number of allowable simultaneous failures. We’ll use the notations “m” and “n” to indicate a specific EC configuration, where “m” indicates the total number of blocks of physical media that will be used to safely encode “n” user blocks. The encoding has the property that “up to m – n” blocks can be destroyed without user data loss. In other words, the survival of any collection of “n” disks is enough to recover all the user data, even if some of the failed disks contained user data. The efficiency of the encoding can be calculated as the number “n / m,” or the ratio of user blocks divided by all blocks.
EC is easiest to understand with examples. Here is a simple example called (3,2) encoding.
A (3,2) encoding requires three blocks (m = 3), stored on three distinct physical devices to safely encode two blocks (n = 2). Two of the blocks contain the user data we want to protect and the third is called a parity block. The contents of the parity block are calculated by the erasure coding algorithm. Even this simple scheme is more efficient than mirroring—you are only writing one parity block for every two data blocks. In a (3, 2) encoding, if the disk containing any one of the three blocks fails, the user data in blocks 1 and 2 is safe.
Here’s how it works. If data block 1 is available, then you simply read it. The same is true for data block 2. However, if data block 1 is down, the EC system reads data block 2 and the parity block and reconstructs the value of data block 1 using the Reed-Solomon formula (which in this particular example is just bit-wise XOR).
Similarly, if data block 2 resides on the failed disk, the systems read data block 1 and the parity block. SBS makes sure that the blocks are on different spindles so the system can read from blocks simultaneously. A (3,2) encoding has efficiency of 2 / 3 (n/m), or 67 percent. While it is better than the 50 percent efficiency of mirroring in terms of data storage, (3,2) encoding can still only protect against a single disk failure.
At a minimum, Qumulo uses (6, 4) encoding, which stores a third more user data in the same amount of space as mirroring, and the ability to tolerate two disk failures instead of just one as mirroring does. Even if two blocks containing user data are unavailable, the system still only needs to read the two remaining data blocks and the two parity blocks to recover the missing data.
Distribution of protected virtual blocks across nodes
There are many practical considerations to take into account when implementing erasure coding in systems with massive scalability. To make the process of writing the required parity blocks easier (and to restore data when a disk fails), SBS divides its virtual address space of 4K blocks into logical segments called protected stores, or pstores.
SBS manages each pstore individually, which makes the mapping scheme that associates the protected address space to the disks more flexible. All pstores are the same size. Data protection is entirely implemented at the pstore level of the system. A pstore can be thought of as a table that maps ranges of protected virtual block addresses to contiguous regions of storage that reside on virtual disks of the nodes of the Qumulo cluster.
The contiguous regions are called bstores. The map of pstores to bstores is stored by each node of the cluster.
For reliability, nodes of the cluster use a distributed algorithm called Paxos to maintain consensus about globally shared knowledge such as the pstore-to-bstore map. The cluster forms a quorum of nodes to ensure the safety of the cluster’s critical data structures. Each bstore uses a segment of a specific virtual disk (that is, the bstore is associated with a particular SSD and HDD pair).
Each bstore is allocated contiguous space on its associated HDD, while space on the bstore’s associated SDD is dynamically allocated. Metadata about a bstore also exists on its associated SSD. Bstore metadata includes information such as the addresses in use, and a map that indicates which block addresses in the bstore reference SSD storage and which are on HDD.
During a read or write, the pstore decides which bstores need to be accessed. When a client of the file system initiates a write operation, it goes into SBS as a stream of raw data. The system figures out which bstores to write the data to, calculates the parity data, and writes both the raw data and parity data to the SSDs at the same time, even if the SSDs are on many different nodes. Once the data is written, the user gets an acknowledgement that the write has taken place.
Data blocks that contain user data and parity blocks are both written to bstores. A particular bstore, for its lifetime, either contains parity blocks or data blocks, but not both. Because EC is implemented at the pstore layer of SBS, bstores that contain parity blocks and bstores that contain data blocks behave identically. The amount of storage allocated to a bstore depends on the choice of EC. To keep the pstore size consistent, the system’s bstore size changes according to the coding scheme. For example, if the pstore is 70GB, then, with (6,4) encoding, each bstore is about 17.5GB, which divides the pstore into 4 bstores. For (10, 8) encoding, the bstores will be about half that size.
In the cloud, Qumulo uses the same data protection scheme as it does on-premise, with one exception. On premise, the data protection scheme requires that there be at least four nodes in a cluster. In the cloud, it’s possible to have a single-node cluster because Qumulo’s file system can use the built-in mirroring that is in every elastic storage block. Single-node Qumulo clusters in the cloud use (5, 4) erasure coding.
Fast rebuild times
Qumulo rebuild times are measured in hours. In contrast, legacy storage systems designed for workloads with far less data, have rebuild times that are measured in days.
Large numbers of files, mixed workloads, and increasing disk density have all contributed to the crisis in rebuild times for legacy storage appliances. Qumulo’s dramatic advantage is a direct result of SBS’s advanced block-based protection. Block-based protection is ideal for today’s modern workloads where there are petabytes of data and millions of files, many of which are small.
The SBS protection system doesn’t need to do time-consuming tree walks or file-by-file rebuild operations. Instead, the rebuild operations work on the pstores. The result is that rebuild times aren’t affected by file size. Small files are handled as efficiently as large files, and protecting millions of files is completely feasible. In addition, Qumulo’s file system is designed so that rebuild times aren’t adversely affected by cluster size. In fact, the opposite is true. With Qumulo, larger clusters have faster rebuild times than smaller clusters.
The reason for this is that SBS distributes the rebuild calculation effort across nodes of the cluster. The more nodes, the less work each node needs to do during a rebuild. Legacy storage appliances with slow rebuild times are vulnerable to additional failures that can occur during the prolonged rebuilding process.
In other words, slow rebuild times have a negative impact on reliability. Typically, admins compensate for this by overprovisioning (that is, decreasing efficiency by adding data redundancy). With Qumulo’s fast rebuild times, admins can maintain their Mean Time To Data Loss (MTTDL) targets without a great deal of redundancy, saving both money and time.
Rebuilding the pstores
When a disk fails, the protected store still exists. It can always be read to and written from. However, some pstores will have missing or damaged bstores. These are called degraded pstores. Because of EC, you can continue to read the degraded pstores, but the data is no longer fully protected. In other words, at the first failure, you still have data integrity but are one disk closer to data loss. To re-protect the data, the system works pstore by pstore (rather than file by file with RAID groups, as in legacy systems) to reconstruct the bstores that were located on the failed disk drive.
SBS allocates a small amount of extra disk space, so there is room to do this. This is called sparing. Since the global pstore-to-bstore map contains the ID of the bstore’s associated virtual disk, this information makes it easy to know which pstores need processing when a particular disk fails. Since the map that associates pstores with bstores is small enough to reside in every node’s memory, nodes can quickly translate virtual block addresses from pstore to bstore.
During the rebuild process, SBS reads and writes bstores sequentially. Since bstores are laid out contiguously on the disk, degraded pstores can be rebuilt very quickly. Sequential operations are much faster than many small I/O operations, which can be slow and cause disk contention. SBS’s rebuild process is efficient—disks are involved in exactly one read or write stream at a time during the rebuild process. No random-I/O is required.
Also, bstores are small enough so that the re-protect work is efficiently distributed across the entire cluster.
Normal file operations unaffected by rebuilds
In legacy file systems, lock contention affects rebuild times and slows down standard file system operations during the rebuild. This is because these file operations compete with the rebuild/reprotect threads. Qumulo uses write layering with independent locking schemes so that rebuild operations don’t contend with normal use of the file system.
When there is a failure, it makes no sense to write to the incomplete set of bstores in the degraded pstores. The new writes would not be fully protected and it would complicate the bstore reconstruction work. However, the cluster must not experience downtime during the rebuild operation, and as a result user-initiated write operations cannot wait for a pstore to be re-protected. To perform those writes, the system adds a new layer of virtual bstores to the degraded pstore. This is called “pushing a layer.” Writes go to the new layer of bstores and reads combine the values from each layer. Here is an example:
New writes go into the top layer of bstores. A read combines the values from the top layer and the bottom layer by using EC. Once the bstore is reconstructed, the push layer goes away. The layers are constructed using components of SBS’s transaction system in a way that makes them non-blocking.
Small files are as efficient as large files
Because the Qumulo file system uses block-based protection, small files are as efficient as large files.
They can share stripes with other files, and they can share the protection. Each file consists of the data blocks, at least one inode block, and any other blocks that are required. Very small files are in-lined into the inode block. The system uses 4K blocks and all the blocks are protected at the system protection ratio.
Qumulo’s efficiency with small files is a big advantage compared to legacy storage appliances, which use inefficient mirroring for small files and system metadata.
All provisioned capacity is available for user files
Qumulo user files can occupy 100 percent of provisioned capacity, while legacy scale-out recommends only using 80 to 85 percent. Qumulo’s block-based protection requires no user-provisioned capacity for re-protection, other than a small amount of space for sparing, which is excluded from user-provisioned capacity. In contrast, legacy storage appliances implement protection either with fixed RAID groups, or with erasure coding on a file-by-file basis, which means that re-protection also happens at the file level and requires user-provisioned capacity to recover. In addition, the Qumulo file system accurately reports capacity available for user files.
Again, this predictability is a consequence of block-based protection. In legacy systems, storage use depends on file size, so these systems can only report on raw space—admins are left to guess how much space they actually have. When comparing Qumulo to legacy systems, you’ll want to consider how much user-provisioned capacity is actually available to use in each type of system.