Today I’m going to talk about blinking red and green lights and how they led us on a lengthy performance investigation that ended deep inside the PCI Express drivers in the Linux kernel.

(Also, if you haven’t already seen it, check out the second part of the Road to All-Flash blog series, by the folks who built the platform I’m discussing in this post.)

The ballad of the blinking lights.

About a year ago I was working on our new NVMe all-flash platform and an odd problem came up.

When an SSD failed, we couldn’t turn on the little red LED on the drive to indicate which one needed to be replaced. This is because an NVMe drive is connected (more or less) directly to the CPU through PCI-Express, as opposed to SATA or SAS drives, which connect to a separate controller that knows how to do things like turn lights on and off.

This may sound trivial, but having visual feedback in the data center is important for storage administrators who have to go and actually replace the failed drive (and not replace the wrong one). Best case scenario here, the drive is so dead the LED is just off. Worst case scenario, the storage admin is faced with a perilous Mission: Impossible choice. (“Cut the red wire!” “What red wire? There are twenty-four wires, and they’re all green!”)

Fortunately, this problem has a solution, and that solution is called Intel Volume Management Device (VMD). VMD is a feature in the root complex on certain Xeon processors that can act as a delegate for NVMe events, routing them to software drivers. In addition to knowing how to turn lights on and off, it also supports more reliable hot plugging – win win!

Well, not quite.

The gaming PC guys were right. LEDs do impact performance.

When we enabled the use of VMD, things got slower. A lot slower. Our throughput benchmarks showed regressions of 50 percent or worse – one of the worst hit workloads previously achieved speeds of around 15 GB/s, but now struggled to reach 6 GB/s.

Initially, we were worried that something about the way VMD works was fundamentally limiting the throughput we could get from the SSDs. VMD does act as a sort of middleman, and one of its effects is aliasing a number of storage devices behind a shared, limited set of interrupt vectors. Without VMD, every drive has its own interrupt vectors that it doesn’t have to share. We suspected that contention on these interrupt resources was what was slowing us down.

As it turned out, we were almost right.

While we dug into the perf data ourselves, we also contacted some of the very smart folks at Intel to help us debug the issue. Their assistance proved invaluable in identifying the true culprit in this mystery.

Averages can be misleading, and other obvious facts

One of the first things we looked at was average I/O request latency for the drives in both configurations – VMD off and VMD on. To our surprise, there was not that big a difference in average latency. It was measurably higher with VMD on, but only a little. Graphs like this (from data captured during a write test) were typical:

An extra 10-15 microseconds per request isn’t great, but it’s not enough to explain 50-60% throughput losses, even if we were totally latency bound.

Meanwhile, the Intel engineers were scrutinizing their driver code. There were a couple of minor issues that they knew about and had in fact already fixed in later kernel versions than the one we were using, so they provided us with patches, we built a custom kernel module, and we were off to the races – but these fixes improved performance only slightly.

There was another issue that they found – the VMD driver was not properly considering the desired CPU affinity of devices when assigning interrupt vectors. The patch to address this also added a driver option – max_vec, which governs the maximum number of interrupt vectors the VMD will attempt to allocate for each device connected to it. The default value had previously been 4.

Another patch, another round of rebuilding the driver, and another set of tests – and, much to our satisfaction, the performance was considerably better. But there was something peculiar as well. As we tried various values for max_vec, we found that performance went strictly down as the value was increased:

Test throughput vs. max_vec
max_vec write read
2 8,080 MB/s 15,460 MB/s
4 5,540 MB/s 13,670 MB/s
8 4,540 MB/s 13,430 MB/s

 

This was unexpected. Eventually, we decided to revisit the data. Clearly there was something we were missing. I started running through the iostat data from a bevy of performance tests, and soon discovered the missing piece: the drives weren’t slightly slower. Exactly one drive was a lot slower:

When I showed an Intel engineer this plot, he had one of those “eureka” moments. The problem wasn’t in VMD after all – it was in a driver for a completely separate PCIe device, a management module embedded in a Microsemi PCIe switch.

Remember how a VMD acts as a sort of middleman, and manages its connected devices through a shared set of interrupt vectors? When the VMD receives an interrupt on one of those vectors, it doesn’t necessarily know which device is the real target. So it must actually invoke the interrupt handlers for all the devices sharing that vector. If one of those interrupt handlers were slower than the others, the rest would simply be forced to wait.

That’s exactly what was happening. The reason that increasing max_vec beyond 2 made things dramatically worse was that assigning more interrupt vectors to each device increased the probability that one (or more!) of the SSDs would end up sharing a vector with the Microsemi switch. Furthermore, because a single write operation in Qumulo’s filesystem will be erasure coded across multiple storage units for data protection, if just one disk involved in a write is slow, the entire write will be slow.

Here’s a condensed version of the offending interrupt handler, found in drivers/pci/switch/switchtec.c in the Linux kernel source:

static irqreturn_t switchtec_event_isr(int irq, void *dev)
{
struct switchtec_dev *stdev = dev;
u32 reg;
/* … */

reg = ioread32(&stdev->mmio_part_cfg->mrpc_comp_hdr);
if (reg & SWITCHTEC_EVENT_OCCURRED) {
/* … */
iowrite32(reg, &stdev->mmio_part_cfg->mrpc_comp_hdr);
}

/* … */
}

Check out those calls to ioread32 and iowrite32, targeting a memory-mapped I/O address on the switch device itself. As part of handling an interrupt, this driver does actual I/O across the PCIe bus (!).

If there were to be just one commandment of writing device drivers, a strong contender would be “Thou shalt not do more work than is absolutely necessary in an interrupt handler.” Maybe waiting for I/O isn’t a big deal for this device, but it becomes a big deal when it ends up sharing an interrupt vector with something extremely latency-sensitive!

Luckily, the solution to this entire problem was simple: don’t load the switchtec kernel module. We didn’t need any of its functionality, and without that slow interrupt handler in the mix, we were back up and running at full speed – with our blinking, colorful lights.

The moral of the story

Outliers can be critically important.

This is one of those things almost everyone “knows,” but it’s an easy thing to forget, especially when you’re chasing a hypothesis. Averages are useful, after all – that one number can tell you a lot about a phenomenon. But don’t draw too many conclusions from the mean unless you understand the underlying distribution!

 

Share with your network