Search
Close this search box.

Azure Native Qumulo Now available in the EU, UK, and Canada – Learn More

Non-Uniform Memory Access (NUMA) Performance is a MESI Situation

Authored by:

“Nu mă, nu mă iei”
– Dan Mihai Bălan

As anyone who’s administered a Linux file system before might know, upgrading to a new version of the Linux kernel is usually not too difficult, but it can sometimes have surprising performance impacts. This is a story of one of those times.

Qumulo’s file system software ships on top of a pretty standard Ubuntu distribution. We periodically update the distro and the underlying Linux kernel to stay on long-term supported versions, so we can continue to stay abreast of the latest security updates and bug fixes, as well as to support newer storage devices.

We recently updated all our platforms to use Linux 5.4 – previously, some had been on 4.4 and some on 4.15. For the most part, everything went smoothly. But in performance testing with the new kernel, we noticed something odd: throughput for our 4U Dual Intel CPU Qumulo systems (known to our customers as the Qumulo QC104, QC208, QC260, and QC360 systems) had dropped – a lot. In one test with a large number of write streams, throughput went from ~4.5 GB/s, down to about 3.2 GB/s – nearly a 30% drop!

These systems use a dual-processor Haswell. Many of our customers have large and active deployments that they depend on to manage their data – and we are consistently working to make our platforms faster over time, not slower!

So it was time to dig in, figure out what had made our software run slower, and fix it.

Monitoring, troubleshooting, and diagnosing Linux performance issues

When diagnosing any kind of performance problem, we usually start by looking at performance counters, latency measurements, and other metrics generated by instrumentation within our filesystem. Linux performance monitoring tools such as these allow us to easily break down where the system is spending its time and more accurately diagnose the source of the issue. In this case, the metrics told a clear story: disk I/O was normal, CPU usage was normal, but there was a lot more time being spent on networking.

This prompted us to look closer for anything network related that might have changed significantly as part of the upgrade. Fortunately we were spared the chore of digging through the kernel code, since we found one prime suspect right away: in addition to upgrading to Linux 5.4, we’d changed Ethernet drivers. Formerly we’d been using OFED for Mellanox NICs, but now we were using the version included with the kernel.

The details of the driver’s code turned out not to be important either, as the real cause of the performance degradation was a small configuration change: OFED includes a script that automatically affinitizes networking interrupts to the closest CPU, the in-box driver does not. Reintroducing the script, or just setting the affinities manually, immediately brought all the throughput back.

So, we had our answer, and with a small tweak we could confidently ship the new distribution with kernel 5.4 to our customers.

Troubleshooting a NUMA-related performance bottleneck

We aren’t satisfied with merely being able to fix problems. We want to understand their underlying causes. And in this case, something seemed odd. In a NUMA system (non-uniform memory access system) it’s usually better to have interrupts locally affinitized, but at the levels of throughput under consideration (only a couple GB/s on any given node) it didn’t make sense that communication between the CPUs could be the bottleneck.

The diagram below shows a simplified picture of the architecture. It has two Xeon E5-2620 v3 CPUs, 128 GB of RAM, and a bunch of disks:

 

Note the links between the two CPUs. These are QuickPath Interconnect (QPI) channels, which are used whenever one CPU needs data that is only available to the other CPU – for instance, if CPU 1 needs to process data that was received from the network, the data will have to cross QPI.

[box type=”shadow”]What is QuickPath Interconnect?

QuickPath Interconnect is a data connection between a CPU and other motherboard resources (such as an IO hub or other CPUs) in some Intel microarchitectures, first introduced in 2008. Its aim is to provide extremely high bandwidth to enable high on-board scalability – after all, there’s no point in putting more CPUs on one motherboard if they can’t make full use of system resources. (It was superseded in 2017 by a new version, called UltraPath Interconnect, with the release of the Skylake microarchitecture.)[/box]

The E5-2620 v3 has two 16-bit QuickPath Interconnect channels clocked at 4GHz. Each channel transfers data on both rising and falling clock edges, resulting in 8 gigatransfers (GT) per second, or 16 GB/s of bandwidth in both directions. So, with two of them, we should get close to 32 GB/s before this link becomes a bottleneck – more than enough to handle the relatively modest requirements of the NIC and storage devices!

However, clearly, we were experiencing a bottleneck, and it went away when we took steps to avoid cross-CPU communication. So what was going on?

Let’s take a look at what needs to happen when a Qumulo node processes a request to read data, say using the NFS protocol. The below diagram shows a simplified version of the data flow:

 

  1. Some data will have to be fetched from other nodes (the blue arrow). This data arrives as a series of TCP segments at the NIC, which are then offloaded via DMA to ring buffers in the kernel, and from there to the Qumulo filesystem’s internal page buffer.
  2. Some data will be fetched from this node (purple arrow). This is read from disk and copied into the page buffer.
  3. Once the local data and remote data have been gathered, a protocol worker assembles it all into a response, which is then put into transmit buffers, from which it will be DMA’d to the NIC and go out over the network to the requesting client.

Getting to the center of the onion

A key insight here is that each of the arrows in the above flow diagram (except the ones exiting the NIC) represent a point where it would be possible for data to cross the QPI link, sometimes more than once.

For example, recall from the architecture diagram that there are storage devices connected to both CPUs. If CPU 0 reads 1GB of data from a disk connected to CPU 1, and then copies it into a region of page buffer mapped to memory attached to CPU 1, that data will cross the link twice. The protocol worker that processes the data might run on CPU 0, requiring the same data to cross the link again, and so on.

So there’s an “amplification effect” in play – even though the node might be serving data at only 2 GB/s, there could be several times that much traffic hitting the QuickPath Interconnect, due to the same data bouncing back and forth, like a game of data tennis:

 

Identifying the real culprit of the NUMA-related performance bottleneck

But wait, I hear you say! Even in the most pessimal of scenarios, this amplification couldn’t turn 2 GB/s into 32 GB/s, there just aren’t enough edges crossing the NUMA boundary in that graph!

That’s true – we seemed to be facing a bottleneck far below the rated speed of the link. Fortunately, the Intel Memory Latency Checker (also known as Intel MLC) can measure the system’s real performance directly, so we ran it, and it confirmed our suspicions:

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
            Numa node
Numa node        0         1
       0    46242.9          6291.7
       1     6276.3         46268.6</var/www/wordpress>

CPU 0 could access its directly connected RAM at ~46 GB/s, and the same for CPU 1 – but the moment either of them wanted to access memory connected to the other CPU, a measly 6 GB/s was the best they could do.

At this point, if you’re very familiar with Intel’s Haswell architecture, you might already know what’s going on. We weren’t especially, so we resorted to Googling the symptoms, and that’s what led us to the correct answer, in an Intel community thread. Simply go into the BIOS, change the “snoop mode” from “early snoop” to “home snoop,” and the bottleneck vanishes!

So, what the heck is an early snoop? Unfortunately, early snoop has nothing to do with either a cartoon beagle or a certain American rapper getting his morning cup of coffee. Instead, we’ll need to talk about one of the two hardest problems in computer science: cache coherence. (The other two are naming things, and off-by-one errors.) Things are about to get MESI. Specifically, snooping is part of the MESI protocol, and by extension the MESIF variant used by Intel processors.

The MESI Cache Coherence Protocol

MESI is a common protocol for enforcing cache coherence, i.e., that all the different caches in the system have a consistent view of the contents of memory. MESI works by assigning one of four states to every line in every cache, which determines how that cache line can be used:

  • Modified: the line has been changed, and no longer matches what is in main memory.
  • Exclusive: the line matches main memory, and is only present in this cache.
  • Shared: the line is unmodified, but may be present in other caches.
  • Invalid: the line is unused or has been invalidated (e.g., by another cache’s copy becoming modified).

MESI vs. MESIF

MESIF is an extension of MESI that was developed by Intel. It adds a fifth state, F for “forward.” Forward is similar to Shared, but only one cache in the system may have a line in the Forward state. This is mostly an optimization to avoid excess replies when a cache line is requested from the system. Instead of every cache that holds a copy of the line responding, only the cache that holds the line in the F state will respond.

In both MESI and MESIF, the various caches are kept coherent by notifications across the bus when important changes happen – for example, if a line is written in one cache, any other cache with a copy needs to have that copy invalidated.

Early Snoop vs. Home Snoop

The reason this consideration is critical for performance has to do with the layout of caches in Intel’s Haswell architecture. The shared, last-level cache (LLC) on each package is divided into a number of slices, one per core, connected to a very high-bandwidth on-die ring. Each cache slice has its own cache “agent.” There is also a “home agent” for each memory controller:

 

In “early snoop” mode (shown above, with two CPUs), when a cache miss or cache coherency event occurs, the initiating cache agent will broadcast a message to all other cache agents in the system. This is intended to reduce access latency by reducing time required to settle the state of a cache line, but with all the cache agents on the remote CPU replying across QuickPath Interconnect, the coherency chatter can significantly reduce available cross-node memory bandwidth. Apparently, with the Haswell-EP E5-2620 v3, it’s enough to lose 75% of your bandwidth.

By contrast, in “home snoop” mode, messages are handled first by the home agents on each memory controller, and then delegated to LLC agents as needed. The extra hop adds a small amount of latency, but with the benefit of greatly increased throughput. Note how there are far fewer messages being sent across QuickPath Interconnect:

 

See this post for a deeper explanation of NUMA cache coherency.

So, how much better is home snoop?

With the snoop mode changed on all the machines in our test cluster, Memory Latency Checker showed dramatically improved throughput between the CPUs:

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
            Numa node
Numa node         0          1
       0     45139.0    25323.8
       1     25336.2    45021.7</var/www/wordpress>

But better still, alleviating this bottleneck also significantly improved the performance of these systems – not only did it eliminate the 30% regression observed when the interrupt affinity was lost, it added another 30% of extra throughput on top:

Test (4 nodes, QC208) Baseline (early snoop) Home snoop Change
Write throughput 4400 MB/ 6000 MB/ +36%
Read throughput 7650 MB/s 9550 MB/s/s +29%

I remember first troubleshooting performance issues on this platform five or six years ago, very early in my career at Qumulo – and I have a dim recollection that we experimented with the snoop mode way back then. At the time, it didn’t make much difference. But over the years, as we have continued to make performance improvements by removing software bottlenecks, the performance of the underlying hardware platform became the limiting factor, so cranking up the QuickPath Interconnect throughput limit became a huge win.

So, in the very next release of Qumulo Core, we added code to flip this setting in the BIOS for all affected models, so that all our customers with existing deployments would benefit from greater throughput capacity.

There’s a lot more great work being done at Qumulo to improve our filesystem’s performance. Most of it is much harder (and even more interesting) than finding a hidden “go fast” switch, so keep watching this space!

Interested in learning more about engineering at Qumulo? See more posts written by Qumulo engineers, here. Alternatively, have a look at the Qumulo Data Platform – Software Architecture Overview (PDF).


FURTHER READING

Related Posts

Scroll to Top