This story comes to you as part 2 of 2 in a short series covering erasure coding and RAID, two of the best known methods for data protection in storage systems. Below we’ll explain how Qumulo’s distributed file system is modernizing erasure coding storage methods with block-based protection, and we will explore how to implement erasure coding in a file storage system.
Qumulo Core, the foundation of Qumulo’s distributed file system, provides extremely reliable data storage (measured in tens of thousands of years of reliability). We achieve this in part through efficient erasure coding, which is one of many data services that come standard with Qumulo Core—designed to keep your data safe.
Implementing erasure coding in file storage systems
There are many practical considerations to take into account when implementing erasure coding in file storage systems. To make the process of writing the required parity blocks easier (and to restore data when a disk fails), Scalable Block Store (SBS) divides its virtual address space of 4K parity blocks into logical segments called protected stores, or pstores.
SBS manages each pstore individually, which makes the mapping scheme that associates the protected address space to the disks more flexible. All pstores are the same size. Data protection is entirely implemented at the pstore level of the system. A pstore can be thought of as a table that maps ranges of protected virtual block addresses to contiguous regions of storage that reside on virtual disks of the nodes of the Qumulo cluster.
The contiguous regions are called bstores. The map of pstores to bstores is stored by each node of the cluster.
pstore vs bstore, what’s the difference?
- pstore: Virtual address space divided into 4K blocks.
- bstore: Contiguous regions of the virtual blocks. (pstores map to bstores.)
For reliability, nodes of the cluster use a distributed algorithm called Paxos to maintain consensus about globally shared knowledge such as the pstore-to-bstore map. The cluster forms a quorum of nodes to ensure the safety of the cluster’s critical data structures. Each bstore uses a segment of a specific virtual disk (that is, the bstore is associated with a particular SSD and HDD pair).
Each bstore is allocated contiguous space on its associated HDD, while space on the bstore’s associated SDD is dynamically allocated. Metadata about a bstore also exists on its associated SSD. Bstore metadata includes information such as the addresses in use, and a map that indicates which block addresses in the bstore reference SSD storage and which are on HDD.
During a read or write, the pstore is used to decide which bstores need to be accessed. When a client of the file system initiates a write operation, it goes into SBS as a stream of raw data. The system figures out which bstores to write the data to, calculates the parity data, and writes both the raw data and parity data to the SSDs at the same time, even if the SSDs are on many different nodes. Once the data is written, the client gets an acknowledgement that the write has taken place.
Data blocks that contain user data and parity blocks are both written to bstores. A particular bstore, for its lifetime, either contains parity blocks or data blocks, but not both. Because erasure coding is implemented at the pstore layer of SBS, bstores that contain parity blocks and bstores that contain data blocks behave identically. The amount of storage allocated to a bstore depends on the erasure coding scheme in place. To keep the pstore size consistent, the system’s bstore size changes according to the coding scheme. For example, if the pstore is 70GB, then, with 6,4 encoding, each bstore is about 17.5GB, which divides the pstore into 4 bstores. For 10,8 encoding, the bstores will be about half that size.
In the cloud, Qumulo uses the same data protection scheme as it does on-premises, with one exception. On-premises, the data protection scheme requires that there be at least four nodes in a cluster. In the cloud, it’s possible to have a single-node cluster, because Qumulo’s file system can use the built-in mirroring that is in every elastic storage block. Single-node Qumulo clusters in the cloud use 5,4 erasure coding.
Never down for the count
Qumulo rebuild times are measured in hours. In contrast, legacy storage systems designed for workloads with far less data, have rebuild times that are measured in days, and some longer than that.
Large numbers of files, mixed workloads, and increasing disk density have all contributed to the crisis in rebuild times for legacy storage appliances. Qumulo’s dramatic advantage is a direct result of SBS’s advanced block-based protection. Block-based protection is ideal for today’s modern workloads where there are petabytes of data and millions or billions of files, many of which are small.
The SBS protection system doesn’t need to do time-consuming tree walks or file-by-file rebuild operations. Instead, the rebuild operations work on the pstores. The result is that rebuild times aren’t affected by file size. Small files are handled as efficiently as large files, and protecting billions of files is completely feasible. In addition, Qumulo’s file system is designed so that rebuild times aren’t adversely affected by cluster size. In fact, the opposite is true. With Qumulo, larger clusters actually have faster rebuild times than smaller clusters.
The reason for this is because SBS distributes the rebuild calculation effort across nodes of the cluster. The more nodes, the less work each node needs to do during a rebuild. Legacy storage appliances with slow rebuild times are at risk of data loss if they incur additional failures during the time the system is already vulnerable during a prolonged rebuild process.
In other words, slow rebuild times have a negative impact on reliability. Typically, admins compensate for this by over-provisioning (that is, decreasing efficiency by adding data redundancy). With Qumulo’s fast rebuild times, admins can maintain their Mean Time To Data Loss (MTTDL) targets without a great deal of data redundancy, saving both money and time.
Rebuilding the pstores in the event of drive failure
When a disk fails, the protected store still exists. It can always be read to and written from. However, some pstores will have missing or damaged bstores. These are called degraded pstores. Because of erasure coding, you can continue to read the degraded pstores, but the data is no longer fully protected. In other words, at the first failure, you still have data integrity but are one drive failure closer to data loss. To re-protect the data, the system works pstore by pstore (rather than file by file in legacy RAID systems) to reconstruct the bstores that were located on the failed disk drive.
SBS allocates a small amount of extra disk space, so there is room to do this. This is called sparing. Since the global pstore-to-bstore map contains the ID of the bstore’s associated virtual disk, this information makes it easy to know which pstores need processing when a particular disk fails. Since the map that associates pstores with bstores is small enough to reside in every node’s memory, nodes can quickly translate virtual block addresses from pstore to bstore.
During the rebuild process, SBS reads and writes bstores sequentially. Since bstores are laid out contiguously on the disk, degraded pstores can be rebuilt very quickly. Sequential operations are much faster than many small I/O operations, which can be slow and cause disk contention. SBS’s rebuild process is efficient—disks are involved in exactly one read or write stream at a time during the rebuild process. No random I/O is required.
Also, bstores are small enough so that the re-protect work is efficiently distributed across the entire cluster.
Business as usual – even during rebuilds
In legacy file systems, lock contention affects rebuild times and slows down standard file system operations during the rebuild. This is because these file operations compete with the rebuild/re-protect threads. Qumulo uses write layering with independent locking schemes so that rebuild operations don’t contend with normal use of the file system.
When there is a failure, it makes no sense to write to the incomplete set of bstores in the degraded pstores. The new writes would not be fully protected and it would complicate the bstore reconstruction work. However, the cluster must not experience downtime during the rebuild operation, and as a result, user-initiated write operations cannot wait for a pstore to be re-protected. To perform those writes, the system adds a new layer of virtual bstores to the degraded pstore. This is called “pushing a layer.” Writes go to the new layer of bstores and reads combine the values from each layer.
Here is an example:
New writes go into the top layer of bstores. A read combines the values from the top layer and the bottom layer by using erasure coding. Once the bstore is reconstructed, the push layer goes away. The layers are constructed using components of SBS’s transaction system in a way that makes them non-blocking.
Small file storage efficiency
Because the Qumulo file system uses block-based protection, small files are as efficient as large files.
They can share stripes with other files, and they can share the protection. Each file consists of the data blocks, at least one inode block, and any other blocks that are required. Very small files are in-lined into the inode block. The system uses 4K blocks and all the blocks are protected at the system protection ratio.
Qumulo’s storage efficiency with small files is a big advantage compared to legacy storage appliances, which use inefficient mirroring for small files and system metadata. Legacy storage systems also use 8K block size, which, with small files, can create great inefficiencies.
It’s your storage – use 100% of it
Qumulo user files can occupy 100 percent of provisioned capacity, while legacy scale-out recommends only using 80 to 85 percent. Qumulo’s block-based protection requires no user-provisioned capacity for re-protection, other than a small amount of space for sparing, which is excluded from user-provisioned capacity.
In contrast, legacy storage appliances implement protection either with fixed RAID groups, or with erasure coding on a file-by-file basis, which means that re-protection also happens at the file level and requires user-provisioned capacity to recover. (See erasure coding vs RAID explained for further details.) In addition, the Qumulo file system accurately reports capacity available for user files.
Again, this predictability is a consequence of block-based protection. In legacy systems, storage use depends on file size, so these systems can only report on raw space—admins are left to guess how much space they actually have. When comparing Qumulo to legacy systems, you’ll want to consider how much user-provisioned capacity is actually available to use in each type of system.
Editor’s Note: Originally published November 11, 2021, this story has been updated for accuracy and comprehensiveness.
Explore a fully functional Qumulo environment, right in your browser.
Test drive Qumulo for free
Explore a fully functional Qumulo environment, right in your browser.Try Demo