When analyzing the performance of a distributed storage system, there are many things to consider. The characteristics of your algorithms (both local and distributed), network topology, the throughput and latency capabilities of your underlying storage media, and many more. But there’s one detail that is easy to overlook: what happens when your program needs to ask for memory from the operating system.

In this post I’ll present a couple of possibly amusing anecdotes about page faults and their impact on performance.

What is a page fault?

A page fault is a kind of exception in virtual memory systems – it happens when a process tries to access a page of memory by reading from or writing to it, and that page is not mapped to a physical chunk of bits in RAM. (A fault also occurs if a process accesses memory it’s not allowed to, but that’s not the kind we’re interested in here.) Page faults are a perfectly normal part of life in most operating systems – when a process maps memory in its virtual address space, the OS often doesn’t physically map that memory until it gets used. However, page faults are not benign to performance, since they increase the latency of certain memory accesses while the operating system updates its page tables.

A Linux kernel performance regression

A while back, our hybrid C-series line was running version 4.8 of the Linux Kernel. This wasn’t ideal, as this was not an LTS kernel release, and it was no longer supported in Ubuntu’s LTS enablement stack, which meant that we were at risk of falling behind the latest bugfixes and security updates. At the time we decided to try the more recent 4.13 kernel – and when we did, performance tests showed throughput dropping by as much as 25 percent.

That’s a pretty shocking regression, so we dug in. Since the only thing that had changed was the kernel, I started by looking at system metrics using the SAR utility sysstat Linux performance monitoring suite.

To help narrow down the search, a technique I like to employ is to run a series of performance tests of both the “good” and “bad” builds, gather several samples of every metric SAR can record, and then compare the two collections of metrics using an independent 2-sample t-test, specifically Welch’s unequal variances t-test. This is a test designed to evaluate the hypothesis that two populations have equal means, and it is conveniently available in the scipy library.

I’ll go more deeply into this technique in a future post, but the gist of it is: ingest the SAR samples into two dataframes, perform the t-test, discard all rows with a p-value higher than some threshold (I usually pick 5%), and sort the remaining ones by t-statistic (ascending order if the inputs were (good, bad) – this emphasizes results that had a negative correlation with performance, and those are often interesting). In this case, the result looked something like this:

Well, the title of this article did mention page faults. And here, it seemed that going to 4.13 had more than doubled the page fault rate for this benchmark. So next I broke out the handy perf tool, which I used to record all mmap, munmap, and page fault events in the kernel on behalf of Qumulo’s filesystem daemon for thirty seconds:

$ perf record \
    -e "syscalls:sys_enter_m*map" \
    -e "exceptions:page_fault_*" \
    -p `pidof qfsd` sleep 30

The recorded events contain a lot of useful information, including the time and address at which each mapping or fault occurred, the type of fault, and so on. I used perf script and some ugly sed work to transform the data into CSV for easier analysis

$ perf script | grep page_fault_ \
    | sed -e 's/.*\[\(.*\)\] \([^:]\+\).*address=0x\([^ ]\+\)f.*error_code=0x\(.*\)/\1,\2,\3,\4/' > faults.csv
$ perf script | grep mmap \
    | sed -e 's/.*\[\(.*\)\] \([^:]\+\): .*addr: 0x\([^,]\+\).*/\1,\2,\3/' > mmaps.csv

(I’m not proud of these regular expressions, but they got the job done!)

From the addresses, I quickly determined that most of the new faults were occurring in the region of memory qfsd uses for block data cache. We map memory in the cache in 2 MiB chunks, but use it in 4 KiB blocks – below is a visualization of the mmap calls and page faults around one such 2 MiB region when using Linux 4.13:

Note the lonesome orange dot in the lower left, followed by a long, long string of page faults (blue dots) in the 2 MiB region above that base address. (Also note the order in which the pages tend to be accessed – backwards in the address space! This slightly surprising detail becomes important later…)

I ran the same measurements against our kernel 4.8 build, and the results couldn’t be more different. Here, there were just one or two page faults subsequent to every mmap!

By now, those of you who are much more familiar with Linux virtual memory configuration than I was are probably screaming the answer in your minds, or maybe out loud. But me, I had to go and learn the hard way. I at least had the sense to suspect that the answer might be found in the kernel build configuration, so I grabbed the kconfigs for both kernels and started bisecting. (We use the kernel as distributed by Ubuntu, so for the most part we take the default configurations. Very convenient, but not so great when something gets slower and you don’t know why!)

After a few bisection iterations, the list of remaining configuration differences was small enough that I just read it over carefully, and one setting jumped out at me. This was the diff:

– TRANSPARENT_HUGEPAGE_ALWAYS=y
+ TRANSPARENT_HUGEPAGE_MADVISE=y

Transparent hugepages are a feature of the Linux memory system that, when enabled for a memory region, cause the kernel to use “huge” 2 MiB pages to back memory mappings instead of the usual 4 KiB. This change meant that THP went from being on by default to being off by default, and was made to the default configuration for Ubuntu’s Xenial and Artful distributions. See this Launchpad issue for more information about why the change was made.

Coincidentally, we mmap 2 MiB at a time in our data cache, which is why previously we usually only took one page fault for each chunk. Mystery solved! Our access patterns really benefit from the smaller page tables and few faults enabled by THP, so we simply enabled it for our process again, and most of the performance regression disappeared.

Page faults II: Electric Boogaloo

Fast forward to late 2018. My team was working on optimizing the performance of very heavy random read workloads (you may have seen recent blog posts by two of my colleagues, Matt McMullan and Graham Ellis, about the same project). We’d made a lot of progress by improving our distributed lock caching system, tuning the task scheduler, making prefetch smarter, and reducing spinlock contention in various places.

However, we noticed that there were some odd sources of variability in our benchmark – some runs were just a little faster than others, for no immediately obvious reason. A look at the SAR data showed that page faults were again strongly correlated with the bimodality. There were a few things we found that could smooth out how memory was accessed here, including ensuring that more of the block cache was initially mapped – but this blog post is already getting a bit long, so maybe I’ll go into more detail on those at a later date!

But remember that plot earlier, showing the backwards access pattern? We remembered it too, and dug a little deeper. It turned out to just be a side effect of the way our cache vends slots, so we tried reversing it, so tasks requesting cache slots would tend to get them in ascending address order. Somewhat to our surprise, this slightly improved performance! (We’re still not entirely sure why – our hunch is that Linux is able to pre-map or do some other optimization in the presence of sequential access.)

This chart shows the relative improvements we made to the random read benchmark over the course of ten weeks or so:

That last bump was all about avoiding page faults! Just goes to show, it’s often important to know what your operating system is doing under the hood.

Share with your network