At Qumulo, we build a scale-out file system. As you might expect, we have many “key-value stores” in our system: for example, there are distributed B-Trees for file system components like directories and extents. There are also various configuration key-values stores. These structures are built on our distributed protected store architecture. Recently, my team built a new distributed key-value store for defining the composition of the protection system used by our clustered file system (QSFS).
Roughly speaking, QSFS is built on top of a collection of Protected Stores (PStores). The PStores provide properties we need for our file system: transactionality, fault tolerance (achieved via erasure coding or mirroring), fault recovery, disk-independent addressing, efficient reads and writes, etc. PStores are composed of data blocks on SSDs and spinning disks on multiple nodes. How we achieve ACID properties with PStores is a topic for another post.
Protection System State
The mapping between PStores and their constituent drive components is fundamental to the reliability and performance of our system. We call this the PStore map. This map is normally static; however, it does change during reprotection after disk or node failure, after disk replacement, and when we add nodes to the system. Additionally, protection system state includes a globally committed transaction number which is used to checkpoint transactions. This number is normally incremented a few times a second. The Pstore map, the generation number, and a few other bits of information make up the protection system state we must reliably store in a distributed, durable, and consistent way.
The existing system for storing protection system state that we set out to replace was a collection of multi-Paxos stores. Every node in the cluster performed the duties of Paxos acceptor and Paxos learner. A Paxos proposer lived on nodes 1 or 2 (quorum leaders). This system had a number of problems we hoped to address with our new key-value store:
- Paxos acceptor and learner data was stored on system drives which were less reliable than the SSDs we used for PStore blocks. We found that even the relatively modest loads generated by the Paxos store led to some premature failures of these devices.
- The PStore map was stored as a single Paxos value. This value could be very large depending on cluster size. Although we update each PStore independently during reprotection (similar to drive rebuilds), these updates required locking and writing of the entire map. This resulted in performance bottlenecks and accelerated wear on our system drives.
- Because Paxos V1 was a system written early in Qumulo’s history, it didn’t utilize our newer testing methods and frameworks. As a result, test code for Paxos V1 was difficult to maintain and extend.
A New Key-Value Store is Conceived
To solve these problems, we designed a new system where Paxos performed a diminished role. A single multi-Paxos instance stores the set of SSDs containing key-value stores which hold the protection system’s state. Paxos acceptor and learner data is stored in the superblock living on each SSD. Each SSD provides space for the full set of protection system keys we need to store in a set of blocks dubbed a KV volume. In every quorum, we read and write to Paxos only once. The identity of N KV volumes define the mirrored Distributed Key-Value (DKV) store. Writing to the DKV is straightforward; write to all N mirrors. Reading can be done from any mirror. During quorum start, we must synchronize one of the KV volumes from the last quorum to N-1 different KV volumes in the new quorum. Simple transactionality is sufficient for this system. An error during write can result in disagreement between individual stores. We pick either the new value or the old value as the synchronization source, eliminating any inconsistency.
Our quorum system ensures that cluster operations execute with a majority of connected nodes. Connection errors cause the cluster to reform quorum where each node participates in a sequence of steps coordinated by a leader. Online operations occur in typically long-lived quorums.
Our new distributed key-value store would still rely on the Paxos protocol to store the set of drive IDs composing the current DKV store. Paxos V2 implementation was straightforward as our needs are modest; we get a single-proposer guarantee from our quorum system, and we have modest performance requirements as we only need to read and write a small amount of Paxos data once per quorum.
Next, we needed a way to identify the blocks on an SSD that would contain the key-value store. We know that the SSDs can provide the space for storing the DKV, but we would need to deal with provisioning in existing clusters later. At this stage in the project, we either created the blocks as needed (tests) or relied on block availability on newly created clusters.
The needed SSD blocks were managed by a KV volume on each SSD. The volume provided several important pieces of functionality:
- Allocation marking. At startup, the KV volumes joined some other on-SSD structures that we walk to determine what is allocated and what is free. The volume blocks are linked with tree blocks to allow fast parallel reads.
- Key mapping from volume key to block address.
- Read and write of value blocks by volume key.
- Efficient synchronization of volumes from one SSD to another.
Given a list of available blocks (again, guaranteed to exist at this point), a builder class would produce the necessary on-disk linkage and the resulting volume objects.
The KV store provides mapping of keys to values on top of the KV volume on a single SSD. Blocks in our system are 4KiB in size; we determined that protection system state values would never need to be as large as 1KiB, so we store 4 values per volume block. The key space is linear and non-sparse. Modification of values involves read-modify-write, so we built a simple volume cache to improve performance.
KV stores needed to be thread safe, so this layer has node local locking as well. Since keys are aliased through the volume value blocks, we hashed volume block ID into a fixed set of mutexes. Read and write APIs have single key and bulk variants.
Distributed KV Store and Synchronization
We created a simple dkv_store class composed of N kv_store instances (for our fault tolerance requirements, N is currently 3). Writes are forwarded in parallel to each member
kv_store via RPC, and reads are forwarded to any member store. This is the first part of the system that is exposed to users (e.g. to the protection system state components). We expose a keyspace that is a direct mapping of the components’ kv_store keys. Later, when we introduced sharding, this mapping changed.
dkv_store instance is the responsibility of a synchronization function. When we start a new quorum, synchronization performs these steps:
- Query online nodes for the list of online KV volumes.
- Read the KV volume ID set from the previous quorum from Paxos.
- Pick an online volume from the previous quorum set to be the source.
- Replicate from the source to N-1 destination volumes unused in the previous quorum.
- Store the new set of volume IDs in Paxos.
Testing of this component was crucial; we built exhaustive tests to cover a matrix of cluster size, down nodes, down disks, and partial progress errors. Since synchronization delays quorum start and hence filesystem availability, we spent some time optimizing and parallelizing this process to reduce its runtime down to a few seconds.
Up until this point, we had been dealing with new clusters only. To deploy our new system, we needed to handle existing clusters at customer sites. We wrote code to upgrade the data from the old Paxos store to the DKV. There were two phases to this: volume provisioning and data translation.
On an active cluster, SSD blocks are in-use for a variety of purposes (e.g. logging blocks, block trees, and write-back cache, for example). We can request blocks but only while the system is fully online. An agent makes blocks available by migrating blocks in a PStore from the SSD to its backing spinning disk. The first part of the upgrade starts an online background process on all nodes to provision blocks from each SSD. Recall that once a volume is built on an SSD, the blocks are linked and allocated, preventing further use for other needs. As each volume is built, the background process informs an upgrade master residing on the leader node. Once all SSDs have volumes, the master initiates a quorum restart.
In the next quorum, the second half of upgrade takes over. Noticing that all volumes are present but no
dkv_store is defined in the new Paxos, a translation process picks an arbitrary set of volumes defining a
dkv_store. The process then reads the old data from Paxos V1 and writes it into the DKV. The process then persists the DKV volume ID set in Paxos V2. As we finish the quorum start, a protection system state shim switches from using the old storage to the new storage. Future quorum starts go through the normal synchronization process outlined above.