At Qumulo we use fio, a popular open-source IO benchmarking tool, to measure the performance of our filesystem and detect regressions. It’s something of a swiss army knife, supporting a wide variety of IO modes and patterns; one of its handy features is a client/server mode, which can direct traffic to a storage cluster from many machines simultaneously. This is useful for simulating many real-world storage workloads, such as a farm of render nodes all writing to one storage cluster — this allows us to characterize just how much throughput one of our clusters can sustain.

In order to reliably detect regressions, it’s important to have consistent measurements: unless a software change makes something faster or slower, a good benchmark should yield about the same result every time. Here are three best practices we follow to help deliver smooth, consistent results that make it a lot easier to spot significant deviations.

1. Use the fio time-based mode

The first one is pretty simple. fio has two basic ways of controlling how much stuff each job does: size-based, and time-based. In size-based mode, each thread will write (or read) a fixed amount of data, then stop. The total amount of data transferred by all jobs, divided by the time it took for the last job to finish, is your measured throughput.

This approach has a problem: not every job will finish its work at the same time. Ordinary fluctuations in process scheduling might mean that one job gets a slightly larger share of the cluster’s throughput than others, or more than its fair share of network bandwidth. This diagram shows four jobs each transferring 10 gibibytes of data:

fio performance measurement

Because one job took a little longer than the others to complete, our measurement interval includes a small but unpredictable duration in which only one job was running. This introduces noise to the results.

The solution to this problem is the fio job parameters time_based and runtime. For example, giving time_based=1 and runtime=60s will cause all jobs to run for sixty seconds and then stop. This helps ensure that the measurement interval always has the cluster working at full load, resulting in more consistent measurements.

2. Introduce a barrier with exec_prerun

Jobs finishing early or late isn’t the only source of measurement jitter! There’s another, more subtle problem: each fio job does not begin doing its work immediately. First, it does some setup and housekeeping, gathering information about the files it will be working on, issuing a lot of stat and other metadata requests. This can take a bit of time, especially if there are many files in the job, and the amount of time each job takes to get ready can vary for the same reasons mentioned above. Running each job for a fixed amount of time is great, but the jobs still won’t finish at the same time if they don’t start at the same time!

The solution to this problem is a bit more involved, since fio itself does not have any mechanism for coordinating job start times. Luckily, it does live up to its name as the “Flexible I/O tester” here, and provides exactly the swiss army knife attachment that we need: exec_prerun, a job parameter allowing you to provide a command that will run immediately before the job begins its work.

You may be familiar with a basic multiprocessing technique known as a barrier: multiple processes can wait at the barrier; the barrier knows how many participants to expect and will not allow any process to continue until all of them are waiting.

For coordination across multiple machines, we created a simple TCP barrier implementation in Python. There’s not much to it – we have a server that waits for N connections:

def server(endpoint, N):

    s = socket.socket(AF_INET, socket.SOCK_STREAM)

    s.bind(endpoint)

    s.listen(N)

    connections = []

    for _ in range(N):

        connections.append(s.accept())

    # Once we’re here, we know everyone’s at the party!

    for conn, _ in connections:

        conn.sendall(b’go!’)

And a client that just connects to the server and waits to be told it can go:

def client(endpoint):

    s = socket.socket(AF_INET, socket.SOCK_STREAM)

    s.connect(endpoint)

    # This will block until the server responds:

    s.recv(3)

The code above is simplified for clarity – in reality of course there is more error handling. Then all we have to do is put something like this in our jobfile:

exec_prerun=./barrier_client.py <barrier_server_ip>

…and now not only will our jobs run for the same amount of time, they’ll all start running at the same time. But of course, there’s more!

3. Restrict client buffer to cap fsync() skid

Finally, there’s a source of variability that is specific to jobs that are writing data. By default on most operating systems (and with fio), data written to files is buffered; that is, a write() syscall will place data in a system buffer and then immediately return. The operating system will then asynchronously flush this data to storage, whether that’s a local disk or remote storage across a network.

Why is this important? Well, careless buffering in a benchmark is cheating! If fio thinks it’s written 300GB of data in 60 seconds, but 50GB of that data is still locally buffered because the client machines have a lot of RAM, it will overestimate the throughput the storage under test achieved. To combat this, we use the end_fsync job parameter to ensure that every job flushes its buffers after it’s done writing.

Unfortunately, this has another side effect: the time it takes to fsync() is not subject to fio‘s job timer – another source of variability! To make matters worse, Linux by default uses a percentage of the system’s available memory to decide when it will start background flushing… and not all of the machines in our lab client pool have the same amount of memory!

Solving this problem is easy enough though. We can just tell Linux exactly how much buffer to use:

sysctl -w vm.dirty_bytes=2000000000

sysctl -w vm.dirty_background_bytes=1000000000

In this case, we’ve specified that background flushing should start when the total buffer grows to 1GB, and at 2GB it should start blocking write() calls. The exact values here aren’t critical: they just need to be 1) the same on every client, 2) small enough that it’s not possible to spend inordinate amounts of time in fsync(), and 3) large enough to not bottleneck performance. These values work well for the tests we run – the clients still drive load at full speed, but the time it takes to flush at the end is far more consistent, improving test signal-to-noise ratio.

(NOTE: You can, of course, use direct/unbuffered IO with fio, which is another way to escape this problem. But this changes the workload characteristics in other ways, and the majority of applications use buffered IO, so we want to have tests that simulate the common case.)

Conclusion

So this is how we ensure our fio-based performance tests deliver smooth, consistent results. Having a good signal-to-noise ratio makes it much easier for us to catch and fix performance regressions — and, of course, it also helps us track performance improvements as we continue to speed up our filesystem.

Learn more about fio.

Contact us

Request a demo or free trial of Qumulo’s file data platform to see radically simple file data management at petabyte scale. 

Subscribe to the Qumulo blog for customer stories, technical insights, product news, and best practices.