Erasure coding (EC) is one of the best-known methods for data protection, due in part to its efficiency, as more of the disk is available for data compared with RAID and mirroring strategies.
One of the main advantages of erasure coding is the flexibility it offers. IT administrators can decide how to strike the right balance between performance and recovery time from physical media failure and the number of concurrent failures they need the system to be able to withstand.
Erasure coding is easiest to understand with examples, which we will discuss in more detail below. But you may be wondering a few things, such as what the heck is erasure coding? How does it compare to RAID and mirroring data protection schemes? And what are the advantages and disadvantages of erasure coding compared to other methods for data protection, like RAID striping and mirroring? These are all important questions that we’ll clear up, putting your enterprise in the best position to keep your data safe.
What is erasure coding?
Erasure Coding is a storage data protection method that leverages advanced mathematics to allow file system software to regenerate missing data using pieces of known data called parity blocks. As we will explain below, erasure coding offers superior data protection to a mirror copy mainly because it doesn’t require a full second copy of the data, yet can restore any missing portion.
Erasure coding vs. RAID: pros and cons
To help explain how erasure coding is superior to other methods of data protection, it helps to understand the various forms of data protection out there as well as their advantages and disadvantages.
Redundant Array of Inexpensive Disks (RAID)
RAID has been around a long time. The most basic data protection configuration is RAID 1, also called Mirroring. As its name suggests, mirroring involves recording data simultaneously to two (or more) drives, thereby making identical copies—mirror images, so to speak.
In a RAID 1 mirroring configuration, because each copy resides on a separate disk, data is recoverable from the ‘mirror image’ should the primary disk in a set fail. Mirroring is simple to implement, but it has some disadvantages. Since mirroring requires at least one full copy of the data, it is wasteful in terms of the space required for data protection. Also, mirroring can only handle a single drive failure at a time, which generally isn’t enough protection for many use cases, particularly as cluster sizes increase.
Beyond mirroring, the RAID standard offers other configurations to optimize for performance, protection, or both. A common option is RAID 5 or disk striping with parity which improves upon efficiency and read performance over mirroring. However, these more advanced RAID configurations can become extremely complex and difficult to manage and maintain. And, in the event of a component failure, rebuild times with RAID can be unacceptably slow, which significantly affects performance for users.
When considering RAID for storage data protection, RAID can’t do it all and often leads to a difficult choice when building RAID configurations: Should IT admins choose between strong data protection, performance, or better storage efficiency? The answer is they want it all, but RAID can’t deliver.
Qumulo Core architecture is built around Qumulo Scalable Block Store (SBS), which is the foundation layer that enables efficient block-based data protection with erasure coding.
Erasure coding is entirely different from RAID and solves RAID’s shortcomings. Unlike RAID striping or mirroring, erasure coding is scalable protection for massive data storage, far more performant, more configurable, and more space-efficient, allowing clusters unlimited growth while maintaining full data protection and responsiveness.
Erasure coding uses advanced mathematics (i.e. the Reed-Solomon formula, in this case) to enable regeneration of missing data from pieces of known data (parity blocks).
So, unlike RAID mirroring which requires a complete second copy, erasure coding allows greater efficiency, requiring just one parity block for every three data blocks (called 3,2 encoding).
Erasure coding explained (examples)
Erasure coding is easiest to understand with examples. Here is our 3,2 encoding example:
In a 3,2 encoding, three blocks (m = 3) are spread across three distinct physical devices. Blocks 1 and 2 contain the user data we want to protect (n = 2), and the third is called a parity block. The contents of the parity block are calculated using the erasure coding algorithm.
Since each block is written to a separate drive, any one of the three drives could fail and the information stored in blocks 1 and 2 is still safe because it can be recreated from the parity block.
How erasure coding works
Here’s how it works. If data block 1 is available, the system simply reads it. The same is true for data block 2. However, if data block 1 is missing, the erasure coding system reads data block 2, plus the parity block, and reconstructs the value of data block 1.
Similarly, if data block 2 resides on the failed disk, the system reads data block 1 and the parity block. SBS always makes sure that the blocks are on different spindles so the system can read from blocks simultaneously.
A 3,2 encoding has efficiency of 2 / 3 (n/m), or 67%. While it is better than the 50% efficiency of mirroring, 3,2 encoding can still only protect against a single disk failure.
Erasure coding provides configurable data protection
Erasure Coding can be configured to optimize for performance, optimized for recovery time in the case of failed media, or optimized for more resilience—up to any four failed disks, or any four failed nodes at once. Generally, increased protection is at the cost of usable capacity.
At a minimum, Qumulo uses 6,4 encoding, which stores a third more user data in the same amount of space as mirroring, and has the ability to tolerate two disk failures instead of just one as mirroring or 3,2 does. In a 6,4 configuration, even if two blocks containing user data are unavailable, the system only needs to read the two remaining data blocks and the two parity blocks to recover the missing data.
What does this all mean?
Working at the block level rather than the file level like other file platforms, Qumulo Core erasure coding not only makes it possible to protect data effectively without having to create a 1:1 copy of the entire data volume, it also means the size of files has no impact on encoding and recovery times. Whether files are mammoth or mini in size, encoding and recovery performance is more than just fast, it is also dependable.
Other systems can take from hours to days, or longer to recover from an event depending on the mix of file sizes stored on the cluster. Qumulo recovers quickly and reliably without impacting performance regardless of the mix of file data stored. This also enables Qumulo customers the ability to leverage the largest, most economical drives in the market without risk.
Learn more in part 2!
In the next entry in this 2-part series on erasure coding, we explain how to implement erasure coding in storage systems for the modern digital era, with massive scalability.
Editor’s Note: Originally published November 3, 2021, this story has been updated for accuracy and comprehensiveness.