At Qumulo we sometimes get asked whether QF2 is a log-structured file system. The short answer is “No,” but there’s an interesting story. It turns out that some of the innovative aspects of QF2’s design draw from a number of groundbreaking techniques that have been developed for file systems over the years, including log-structured file systems. To explain this, we’ll give a bit of history about how file systems have historically ensured consistency in the event of unclean shutdowns such as power failures.
Then, we’ll dive into how engineers at Qumulo solved some of the problems of legacy file systems by using a new approach called a scalable block store (SBS).
Early disk systems required a full walk of all data structures with utilities such as fsck to bring the file system back to a consistent state after an interruption such as a power failure. Although utilities like fsck are time consuming and keep the file system offline, their walk-and-repair approach was acceptable because disks were small.
As disk sizes grew, the outage time caused by walk-and-repair became too long. There was a desire for faster recovery to a consistent state after a crash.
The first attempt at solving the walk-and-repair bottleneck occured in the 1980s with the introduction of journaling file systems. A journal is a history of recent modifications to a file system. Journaling file systems guarantee consistency of the file system after a crash with a much, much smaller amount of work than a full walk-and-repair operation because the journal records all recent changes. Repairing the file system’s data structures can be done quickly because the system only needs to read the most recent journal entries to do the work. Journaling also improved the chances that recently written data can be recovered after a crash.
One of the problems of journaling file systems was that every write operation required double the effort because the system had to write to the data store and to the journal. The desire to eliminate double writing motivated some new approaches. For example, the first log-structured file system, conceived in 1988, did away with the write-twice penalty of journaling with a journal that was the file system. In other words, a log-structured file system was laid out on disk as one big sequential journal, or history of file modifications. Writing to the journal was the same operation as writing to the file system. As with journaling, such a file system required only a small amount of computation to bring the system to a consistent state after a crash. Unlike journaling, the log-structured approach did not require dual write operations.
The log-structured file system also improved write performance on spinning magnetic disks. Since changes were always appended to the end of the log, a sequence of file system changes could be buffered and written in a single disk write operation, reducing the need for time-consuming seek operations. The improvement in write performance was noticeably pronounced when writing to small files.
Of course, the log-structured approach came with trade-offs, such as additional complexity to index the log for acceptable subsequent read performance and the need to implement automatic garbage collection to reclaim unused storage space and reduce fragmentation.
Although the log-structured approach, as originally conceived, is not currently used in any production file system today, its core concept of using carefully designed data structures as a way to prevent inconsistencies and avoid the write-twice penalties of journaling was highly influential. Its influence can be seen in the “soft update” and “copy-on-write” approaches that followed.
Like log-structured file systems, soft updates have been tried as a away to avoid the double-write requirement of journaling. This approach uses a carefully designed, restricted pattern of ordered updates that, if interrupted by a crash, always leave the file system in a consistent state. Soft updates solve the write-twice problem but can have the drawback of making some unused storage inaccessible. To cure this storage leak, file systems with soft updates come with additional complexity sometimes called “fsck in the background”. They are also extremely difficult to implement reliably.
Like log-structured file systems, copy-on-write file systems always write data out of place, that is, to newly allocated storage blocks, to avoid in-place updates to file metadata. The difference is that log-structured file systems write sequentially into a log while copy-on-write systems use nondestructive pointer magic to incorporate the new data into what is conceptually a new version of the entire file system. To do this they use data structures known (confusingly) as persistent data structures (or as immutable data structures) to represent file system metadata. WAFL and ZFS are examples of file systems that use copy-on-write techniques.
Broadly speaking, QF2 follows the spirit of log-structured file systems and copy-on-write file systems in avoiding the double-write problem of journaling file systems. In QF2, all new data is written in new blocks. Old blocks aren’t overwritten. Log-structured file systems, copy-on-write file systems and QF2 all share the characteristic of writing new data out of place. However, that’s about where the similarities end.
Unlike log-structured file systems and copy-on-write file systems, QF2 was designed from the ground up to be distributed across many computers for scalable performance and scalable capacity, with built-in data protection based on efficient erasure coding. As a result, the QF2 file system sits on top of a transactional virtual layer of protected storage blocks called the Scalable Block Store (SBS).
Here is a diagram of the SBS.
The SBS defines contiguous regions of storage that reside on virtual disks of the nodes of the QF2 cluster. These regions are called bstores. The bstores consist principally of two components, a block tree and a write ahead log (WAL). QF2 initially writes to the WAL as part of a transaction. All data blocks are written out of place. When the transaction is complete and it’s time to checkpoint data from the WAL, QF2 updates the block tree as a persistent data structure. Updating the block tree is conceptually like creating a new version of the entire file system but is implemented at very little cost. Blocks from the WAL aren’t copied into the block tree, just newly referenced.
The WAL is a linked list (not a true persistent data structure), but when SBS appends to the WAL it overwrites only the terminating null value. Overwriting the tail of the WAL as an atomic operation is in the spirit of soft updates because the order of writing is carefully designed to cause no data corruption if writes are interrupted by a power failure. Writes live in the WAL for only a short period until changes to the block tree are complete.
All writes in QF2 go to SSDs, which means that the QF2 SBS can allocate blocks arbitrarily without the performance penalty of disk seeks. When the number of free blocks on the system’s SSDs becomes low, the SBS uses HDDs as a backing store (except, of course, in all-flash environments). When tiering from SSD to HDD, the SBS uses highly efficient sequential write patterns that maximize the performance of the HDDs.
So, in some ways, the bstores in the QF2 SBS are similar to log-based file systems, except that they just allocate blocks arbitrarily in the SSD and don’t need to maintain a circular buffer. In other ways, QF2’s bstores are similar to copy-on-write systems, except that they operate at the block level and not the file level, and they write to virtual blocks that are always fronted by SSDs. Finally, the SBS WAL uses techniques similar to soft updates to provide protection in case of unexpected interruption such as power loss. We stand, as they say, on the shoulders of giants.
For a walkthrough of how the SBS works, see Qumulo File Fabric Technical Overview.
We are always looking for new challenges in enterprise storage. Drop us a line and we will be in touch.
Enter a search term below