Qumulo’s File Storage System

Hot/Cold Tiering for Read/Write Optimization

Hot/cold tiering for read/write optimization

The Scalable Block Store (SBS) includes built-in tiering of hot and cold data to optimize read/write performance.

When running on-premise, Qumulo takes advantage of the speed of solid-state drives (SSDs) and the cost-effectiveness of hard disk drives (HDDs). SSDs are paired with commodity HDDs on each node. This pair is called a virtual disk. There is a virtual disk for every HDD in the system. Data is always written first to the SSDs.

Because reads typically access recently written data, the SSDs also act as a cache. When the SSDs are approximately 80 percent full, less frequently accessed data is pushed down to the HDDs. The HDDs provide capacity and sequential read/writes of large amounts of data. When running in the cloud, Qumulo optimizes the use of block storage resources by matching high-performance block storage with cost-effective lower-performance block storage. Let’s look at the following aspects of SBS’s hot/cold tiering:

  • How and where data is written
  • Where metadata is written
  • How data is expired
  • How data is cached and read

The initial write

To write to a cluster, a client sends some data to a node. That node picks a pstore (or multiple pstore) where that data will go – in terms of hardware, it always writes to the SSDs or to low-latency block storage if using cloud resources. (Recall that we use SSD to mean both on-premise SSDs and low-latency block storage in the public cloud; the behavior is similar.)

These SSDs will be on multiple nodes. All writes occur on SSDs; SBS never writes directly to the HDD. Even if an SSD is full, the system makes space for the new data by purging previously cached data.

Handling metadata

Generally, metadata stays on the SSD. Data is typically written to a bstore at the lowest available address so data grows from the beginning of the bstore to the end. Metadata starts at the end of the bstore and grows toward the beginning. This means all the metadata is to the right of the data. Here is an illustration of where metadata sits on a bstore.

 

Qumulo allocates up to 1 percent of the bstore on the SSD to metadata and never expires it. Nothing in that 1 percent goes to the HDD. If metadata ever grows past that 1 percent, it could expire but, for a typical workload, there is approximately 0.1 percent metadata. The space isn’t wasted if there isn’t enough metadata to fill it. Data can use that space as well.

Expiring data

At some point, the system needs more space on the SSD, so some of the data is expired, or moved, from the SSD to the HDD. The data is copied from the SSD to the HDD and then, once it’s on the HDD, it’s deleted from the SSD. Expiration starts when an SSD is at least 80 percent full and stops when it gets back to less than 80 percent full. The 80 percent threshold is a heuristic that optimizes performance—writes are faster when the SSDs are between zero and 80 percent and expirations aren’t happening at the same time. When data from an SSD is moved to HDD, SBS optimizes the writes sequentially to HDD in a way that optimizes disk performance. Bursts of large, contiguous bytes are the most efficient method possible to write to HDD.

Caching data

The following illustration shows all the Qumulo caches. Everything in green is a place that can hold data, and it can be on SSD or HDD.

Qumulo I/O operations use three different types of caches. The client always has some cache on its side and there are two types of caches on the nodes. One is the transaction cache, which can be thought of as the file system data that the client is directly requesting. The other type is the disk cache, which are blocks from that disk that are kept in memory.

As an example, assume that a client that is connected to node 1 initiates a read of file X. Node 1 discovers that those blocks are allocated on node 2, so it notifies node 2 that it wants data, which in this example is stored in one of the SSDs of node 2. Node 2 reads the data and puts it into the disk cache associated with this SSD. Node 2 replies to node 1 and sends the data. At this point the data goes into the node 1 transaction cache, which notifies the client that it has the data for file X.

The node that the disk is attached to is where the disk cache is populated. The node that the client is attached to is where the transaction cache gets populated. The disk cache always holds blocks and the transaction cache holds data from the actual files. The transaction cache and the disk cache share memory, although there’s not a specific amount allocated to either one.

Want to learn more?

Give us 10 minutes of your time, and we’ll show you how to rethink data storage.