Windows Storage Offloaded Data Transfer (ODX) and the Case of the Missing Prefetch

Authored by:
How one customer experienced slower than expected performance when comparing standard READ/WRITE to ODX – and how the engineering team leveraged Qumulo Core’s metrics, fast release cycle, and seamless upgrade experience to uncover and resolve the mystery!

How one customer experienced slower than expected performance when comparing standard READ/WRITE to ODX – and how the engineering team leveraged Qumulo Core’s metrics, fast release cycle, and seamless upgrade experience to uncover and resolve the mystery!

What is ODX (Windows Storage Offloaded Data Transfer)?

Windows Storage Offloaded Data Transfer (ODX) is part of the server message block (SMB) protocol where you can tell the Qumulo File System to copy a chunk of a file from one file to another without sending the data through the client machine. ODX improves copy performance much like server-side copy (SSC). It eliminates network latency between the client and server, and allows Qumulo Core to optimize IO sizes and buffering. When comparing ODX vs SSC, our internal test results show ODX to be 7X faster, and SSC to be 4X faster, than the typical READ/WRITE sequence that occurs when copying a file to another location on the same server.

Read Latency Mystery

In 2021, we shipped ODX support for Qumulo Core version 3.0.0 or above. Soon after we shipped, a customer reported ODX performance was slower than expected.

Knowing that ODX reads from one file and writes to another, we initially suspected the write performance to be the problem. Typically, read operations take a maximum of a few milliseconds (thanks SSD caching and prefetching!). Yet when we looked at the read latency, we saw that it was a larger part of the ODX latency than we might expect – on the order of 25ms (approx.), with higher spikes to 40 or 50ms. What’s going on with that? Here’s a visualization of the lifetime of the read phase of an ODX operation:

We looked at our low-granularity timeseries performance data we collect via our cloud monitoring pipeline. On average our RAM cache hit rate and latency metrics looked good. But then why were the ODX specific reads commonly slow?

The First Clue

As we got closer to uncovering the root cause, we needed higher granularity metrics that were specific to a single ODX workload on this cluster – without the noise of other operations. Luckily, we have a tool, called trigger</var/www/wordpress>, which we built internally, which collects this kind of distributed performance data when needed. We worked with our customer to profile some specific ODX sequences, and used trigger</var/www/wordpress> to collect data during those sequences. We learned a couple of things. (Refer to Figure 1, above.)

In each read operation, most blocks were in RAM cache, but each read was missing cache for a few blocks. When this happens, the file system transaction has to read the data from the storage devices themselves. In this case, the metrics told us the data was on spinning disks.

[box type=”shadow”]Note: In my mind I was thinking “of course, it had to be the spinning disks…it’s ALWAYS the spinning disks.” I should have known that seeks are the only thing that could have pushed the latency up to 50ms. Spinning disks and I don’t have the best relationship. They have an infuriating but useful habit of painfully exposing unintended behavior in performance sensitive code paths.[/box]

Spinning disks can stream data efficiently when requests are queued in a contiguous order; starting at the beginning of a file and reading until the end. However, in this case, we were only missing a few blocks from cache per ODX read operation – which shows up to spinning disk as a lot of discontiguous reads. That’s bad news for spinning disk latency.

Normally, file system read operations don’t experience this kind of spinning disk latency, because we have an effective prefetcher. The prefetcher guesses what’s going to be read next, and loads it into cache ahead of time. The prefetcher has an IO pattern optimized to read from spinning disks.

So, why wasn’t the prefetcher working?

Our prefetcher has metrics that describe how successful it is on average. These metrics told us that everything being prefetched was being read, but some data the file system read wasn’t getting prefetched…that’s really weird! Maybe the ODX operations are reading out of order or jumping around randomly? The prefetcher doesn’t do well with random IO.

We checked the operation logs to see the IO pattern. Operation logs are a text log file tool that tells us information about each operation, like offset, size, and cache hit or miss. However, we hadn’t anticipated we’d need the ODX read sizes and offsets in our operation logs, and as a result that code hadn’t been hooked up! So we had to write some more code, and wait for an upgrade…

Finding the Culprit

Within a couple of weeks, we had shipped new code, and using our super awesome upgrades, the customer was able to upgrade their Qumulo storage system to the latest version soon after. They re-ran the test workload for us. Using the improved operation logging, we found another puzzle! By reading operation logs (I felt like I was in the matrix reading the green numbers go by) we found that our ODX protocol code was sending the file system contiguous read requests for 8650752 bytes each. When we did the math, that comes out to 8 mebibytes + 256KiB. But! The operation log showed that each ODX request was actually only getting 8388608 bytes from cache (exactly 8 mebibytes).

I remember at first my eyes glazed over…was I misreading the numbers? How is each request missing cache for only 256KiB of the 8MiB it was asking for? How is the prefetcher only fetching a subset of each request? That’s impossible!

[box type=”shadow”]Note: The first stage of every bug is denial – believing that the data is wrong. Luckily, this wasn’t my first rodeo. I quickly stifled my denial, and started asking myself “how could this be possible”?[/box]

I started explaining how the prefetcher works to my coworker who was working with me on the investigation. The prefetcher should be getting ahead of the read workload, such that the IO size of the read workload doesn’t matter. The prefetcher always fetches data in large chunks and in such a way that the queueing on the spinning disks is well-ordered to avoid seeks. And then it hit me…large chunks…how big are the prefetch chunks? OH…they are 8 MiB…wait that’s spooky. And how does it get ahead of the read workload? OH, we kick off one prefetch per read request…wait…putting those two things together…if we only kick off a single prefetch per read request, and the prefetch chunk size is at most 8 MiB, and the reader is reading chunks of 8MiB + 256KiB…then yeah, obviously, the prefetcher will never get ahead…it will always prefetch the next 8MiB, leaving a tail of 256KiB uncached. WOW. (Fist pump). Ok. We found the smoking gun.

Example Illustrations of the Prefetch Problem

Mystery Explained

Figuring out the bug is just the beginning. There were immediately three questions that came to mind:

  1. Why doesn’t this same problem happen in our standard file system read tests?
  2. Why is ODX using an IO size of 8 MiB + 256 KiB, larger than the prefetcher’s max IO size, and
  3. Why didn’t our ODX specific performance tests catch this?
1. Why doesn’t this problem happen in our standard file system read performance tests?

Our file system protocol scheduler processes all protocol requests. Typical protocol request size is 1MiB. When we receive protocol read requests we process them and potentially combine requests to optimize read performance. We naturally have a maximum combined IO size for these file system requests. If you let the IO size grow too large then the request can be bottlenecked on some large single threaded thing it was doing (for example CPU, memory allocation). So our file system scheduler makes the maximum combined read request 8MiB, the same as our prefetcher’s max IO size. This was by design. So in these workloads, the prefetcher is prefetching everything, although it isn’t ever getting more than 8MiB ahead if the combined IO size is also 8 MiB…(scratches headthat also wasn’t intended behavior…).

2. Why is ODX using an IO size of 8 MiB + 256 KiB?

The ODX operation itself goes through the scheduler, but doesn’t put its internal read request through the scheduler. ODX was then setting the maximum IO read size it would request, independent of the scheduler.

We found that the scheduler’s max IO size and ODX max IO size were using different functions:

fs_target_transaction_size()
fs_max_transaction_size()</var/www/wordpress>

The scheduler was using fs_target_transaction_size</var/www/wordpress>, which is a minimum of fs_max_transaction_size</var/www/wordpress> and 8MiB! Whereas the ODX code was intuitively using fs_max_transaction_size</var/www/wordpress> - which is computed, and turned out to be 8MiB + 256KiB.

So there was our root cause. We fixed the ODX protocol code to use fs_target_transaction_size</var/www/wordpress>, and voilà, no more uncached tails to the read requests.

3. Why didn’t our performance tests of ODX catch this?

Our ODX performance tests did, in fact, exhibit the behavior of uncached tails. But, the performance wasn’t that bad; because, in this performance test, the data was small enough to fit in the SSD cache!

Looking back at Figure 1 (above), you can see that the latency of reads from SSD is much lower than spinning disk. So it wasn’t as obvious from the top line performance test results that the prefetcher was doing the wrong thing. The test wasn’t explicitly checking the prefetcher’s metrics.

Once we changed the performance test to do a second step of evicting the data from SSD, and then reading it back from spinning disk, the topline performance was significantly worse than it should have been — revealing the bug. We fixed the bug, observed this automated performance test improve, and shipped the improvement in 2 weeks. Once again leveraging Qumulo’s easy upgrades, our customers were able to upgrade soon after the fix and get the improved ODX performance.

Learn more
0 0 votes
Article Rating
Subscribe
Notify me about
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Related Posts

0
Would love your thoughts, please comment.x
()
x
Scroll to Top