Following the Cookie Crumbs: Investigating a Performance Anomaly at Qumulo

Posted February 26, 2018 by Ted Carpenter — Member of Technical Staff, Chris Cole — Member of Technical Staff. Filed under Engineering.

Following the Cookie Crumbs: Investigating a Performance Anomaly at Qumulo

Here at Qumulo, we build a high performance, distributed file system and continually release updates every two weeks. Shipping enterprise-grade software so frequently requires an extensive test suite to ensure we’ve made a high quality product. Our performance test suite continuously runs across all of our platform offerings and includes file performance tests run against industry standard benchmarks.

This post digs into how we identified and fixed a sporadic performance anomaly we recently observed in one of our benchmarks.

Enter the Anomaly

Over a period of a few months, we observed variability in our multi-stream read and write benchmarks. These benchmarks use IOzone to generate concurrent reads and concurrent writes against the cluster, and measure the aggregated throughput across all connected clients. In particular, we observed a bimodal distribution where most runs reached a consistently stable performance target while a second, smaller set of results sporadically ran about 200-300MB/s slower, which is about 10% worse. Here is a graph that shows the performance results.

Characterizing the Problem

When investigating any performance anomaly, the first step is to remove as many variables as possible. The sporadic results were first identified across hundreds of software versions over a timespan of months. To simplify matters, we kicked off a series of runs of the benchmark, all on the same hardware and on a single software version. This series of runs showed the same bimodal distribution, which meant that the variability couldn’t be explained by hardware differences or software-version-specific regressions.

Having reproduced the bimodal perf on a single version, we then compared the detailed performance data collected from a fast run and a slow run. The first thing that jumped out was that inter-node RPC latencies were much higher for the bad runs than for the good runs. This could have been for a number of reasons, but hinted at a networking-related root cause.

Exploring TCP Socket Performance

With that in mind, we wanted more detailed data on our TCP socket performance from test runs, so we enabled our performance test profiler to continually gather data from ss. Each time ss runs, it outputs detailed stats for each socket in the system:

1
2
3
4
5
6
7
>ss -tio6
State      Recv-Q Send-Q                                                                                         Local Address:Port                                                                                                          Peer Address:Port
ESTAB      0      0                                                                            fe80::f652:14ff:fe3b:8f30%bond0:56252                                                                                            fe80::f652:14ff:fe3b:8f60:42687                 
sack cubic wscale:7,7 rto:204 rtt:0.046/0.01 ato:40 mss:8940 cwnd:10 ssthresh:87 bytes_acked:21136738172861 bytes_received:13315563865457 segs_out:3021503845 segs_in:2507786423 send 15547.8Mbps lastsnd:348 lastrcv:1140 lastack:348 pacing_rate 30844.2Mbps retrans:0/1540003 rcv_rtt:4 rcv_space:8546640
ESTAB      0      0                                                                            fe80::f652:14ff:fe3b:8f30%bond0:44517                                                                                            fe80::f652:14ff:fe2b:4030:45514                 
sack cubic wscale:7,7 rto:204 rtt:2.975/5.791 ato:40 mss:8940 cwnd:10 ssthresh:10 bytes_acked:2249367594375 bytes_received:911006516679 segs_out:667921849 segs_in:671354128 send 240.4Mbps lastsnd:348 lastrcv:1464 lastack:348 pacing_rate 288.4Mbps retrans:0/19980 rcv_rtt:4 rcv_space:6539064
…

Each socket in the system corresponds to one entry in the output.

As you can see by the sample output, ss dumps its data in a way that isn’t very friendly to analysis. We took the data and plotted the various components to give a visual view of the TCP socket performance across the cluster for a given performance test. With this plot, we could easily compare the fast tests and slow tests and look for anomalies.

The most interesting of these plots was the congestion window size (in segments) during the test. The congestion window (signified by cwnd: in the above output) is crucially important to TCP performance, as it controls the amount of data outstanding in-flight over the connection at any given time. The higher the value, the more data TCP can send on a connection in parallel. When we looked at the congestion windows from a node during a low-performance run, we saw two connections with reasonably high congestion windows and one with a very small window.

Looking back at the inter-node RPC latencies, the high latencies directly correlated with the socket with the tiny congestion window. This brought up the question - why would one socket maintain a very small congestion window compared to the other sockets in the system?

Having identified that one RPC connection was experiencing significantly worse TCP performance than the others, we went back and looked at the raw output of ss. We noticed that this ‘slow’ connection had different TCP options than the rest of the sockets. In particular, it had the default tcp options. Note that the two connections have vastly different congestion windows and that the line showing a smaller congestion window is missing sack and wscale:7,7.

1
2
3
4
ESTAB      0      0      ::ffff:10.120.246.159:8000                  ::ffff:10.120.246.27:52312                 
sack cubic wscale:7,7 rto:204 rtt:0.183/0.179 ato:40 mss:1460 cwnd:293 ssthresh:291 bytes_acked:140908972 bytes_received:27065 segs_out:100921 segs_in:6489 send 18700.8Mbps lastsnd:37280 lastrcv:37576 lastack:37280 pacing_rate 22410.3Mbps rcv_space:29200
ESTAB      0      0      fe80::e61d:2dff:febb:c960%bond0:33610                fe80::f652:14ff:fe54:d600:48673
     cubic rto:204 rtt:0.541/1.002 ato:40 mss:1440 cwnd:10 ssthresh:21 bytes_acked:6918189 bytes_received:7769628 segs_out:10435 segs_in:10909 send 212.9Mbps lastsnd:1228 lastrcv:1232 lastack:1228 pacing_rate 255.5Mbps rcv_rtt:4288 rcv_space:1131488

This was interesting, but looking at just one socket datapoint didn’t give us much confidence that having default TCP options was highly correlated with our tiny congestion window issue. To get a better sense of what was going on, we gathered the ss data from our series of benchmark runs and observed that 100% of the sockets without the SACK (selective acknowledgement) options maintained a max congestion window size 90-99.5% smaller than every socket with non-default TCP options. There was clearly a correlation here between sockets were missing the SACK option and the performance of those TCP sockets, which makes sense as SACK and other options are intended to increase performance.

How TCP options are set

TCP options on a connection are set by passing options values along with messages containing SYN flags. This is part of the TCP connection handshake (SYN, SYN+ACK, ACK) required to create a connection. Below is an example of an interaction where MSS (maximum segment size), SACK, and WS (window scaling) options are set.

blogpost-cookies-graph4

So where have our tcp options gone?

Although we had associated the missing SACK and window scaling options with smaller congestion windows and low-throughput connections, we still had no idea why these options were turned off for some of our connections. After all, every connection was created using the same code!

We decided to focus on the SACK option because it’s a simple flag, hoping that would be easier to debug. In Linux, SACK is controlled globally by a sysctl and can’t be controlled on a per-connection basis. And we had SACK enabled on our machines: >sysctl net.ipv4.tcp_sack net.ipv4.tcp_sack = 1

We were at a loss as to how our program could have missed setting these options on some connections. We started by capturing the TCP handshake during connection setup. We found that the initial SYN message had the expected options set, but the SYN+ACK removed SACK and window scaling.

We cracked open the Linux kernel’s TCP stack and started searching for how the SYN+ACK options are crafted. We found tcp_make_synack, which calls tcp_synack_options:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
static unsigned int tcp_synack_options(const struct sock *sk,
                       struct request_sock *req,
                             unsigned int mss, struct sk_buff *skb,
                             struct tcp_out_options *opts,
                             const struct tcp_md5sig_key *md5,
                             struct tcp_fastopen_cookie *foc)
{
    ...
    if (likely(ireq->sack_ok)) {
        opts->options |= OPTION_SACK_ADVERTISE;
        if (unlikely(!ireq->tstamp_ok))
            remaining -= TCPOLEN_SACKPERM_ALIGNED;
    }
    ...
    return MAX_TCP_OPTION_SPACE - remaining;
}

We saw that the SACK option is simply set based on whether the incoming request has the SACK option set, which was not very helpful. We knew that SACK was getting stripped from this connection between the SYN and SYN+ACK, and we still had to find where it was happening.

We took a look at the incoming request parsing in tcp_parse_options:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
void tcp_parse_options(const struct net *net,
               const struct sk_buff *skb,
               struct tcp_options_received *opt_rx, int estab,
               struct tcp_fastopen_cookie *foc)
{
    ...
            case TCPOPT_SACK_PERM:
                if (opsize == TCPOLEN_SACK_PERM && th->syn &&
                    !estab && net->ipv4.sysctl_tcp_sack) {
                        opt_rx->sack_ok = TCP_SACK_SEEN;
                        tcp_sack_reset(opt_rx);
                }
                break;
    ...
}

We saw that, in order to positively parse a SACK option on an incoming request, the request must have the SYN flag (it did), the connection must not be established (it wasn’t), and the net.ipv4.tcp_sack sysctl must be enabled (it was). No luck here.

As part of our browsing we happened to notice that when handling connection requests in tcp_conn_request, it sometimes clears the options:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
int tcp_conn_request(struct request_sock_ops *rsk_ops,
             const struct tcp_request_sock_ops *af_ops,
             struct sock *sk, struct sk_buff *skb)
{
    ...
    tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);

    if (want_cookie && !tmp_opt.saw_tstamp)
        tcp_clear_options(&tmp_opt);
    ...
    return 0;
}

We quickly found out that the want_cookie variable indicates that Linux wants to use the TCP SYN cookies feature, but we didn’t have any idea what that meant.

What are TCP SYN cookies?

SYN flooding

TCP servers typically have a limited amount of space in the SYN queue for connections that aren’t yet established. When this queue is full, the server cannot accept more connections and must drop incoming SYN requests.

This behavior leads to a denial-of-service attack called SYN flooding. The attacker sends many SYN requests to a server, but when the server responds with SYN+ACK, the attacker ignores the response and never sends an ACK to complete connection setup. This causes the server to try resending SYN+ACK messages with escalating backoff timers. If the attacker never responds and continues to send SYN requests, it can keep the servers SYN queue full at all times, preventing legitimate clients from establishing connections with the server.

Resisting the SYN flood

TCP SYN cookies solve this problem by allowing the server to respond with SYN+ACK and set up a connection even when the SYN queue is full. What SYN cookies do is actually encode the options that would normally be stored in the SYN queue (plus a cryptographic hash of the approximate time and source/destination IPs & ports) entry into the initial sequence number value in the SYN+ACK. The server can then throw away the SYN queue entry and not waste any memory on this connection. When the (legitimate) client eventually responds with an ACK message, it will contain the same initial sequence number. The server can then decode the hash of the time and, if it’s valid, decode the options and complete connection setup without using any SYN queue space.

Drawbacks of SYN cookies

Using SYN cookies to establish a connection has one drawback: there isn’t enough space in the initial sequence number to encode all the options. The Linux TCP stack only encodes the maximum segment size (a required option) and sends a SYN+ACK that rejects all other options, including the SACK and window scaling options. This isn’t usually a problem because it’s only used when the server has a full SYN queue, which isn’t likely unless it’s under a SYN flood attack.

Below is an example interaction that shows how a connection would be created with SYN cookies when a server’s SYN queue is full.

blogpost-cookies-graph6

Qumulo’s TCP problem

After studying TCP SYN cookies, we recognized that they were likely responsible for our connections periodically missing TCP options. Surely, we thought, our test machines weren’t under a SYN flood attack, so their SYN queues should not have been full.

We went back to reading the Linux kernel and discovered that the maximum SYN queue size was set in inet_csk_listen_start:

1
2
3
4
5
6
7
int inet_csk_listen_start(struct sock *sk, int backlog)
{
    ...
    sk->sk_max_ack_backlog = backlog;
    sk->sk_ack_backlog = 0;
    ...
}

From there, we traced through callers to find that the backlog value was set directly in the listen syscall. We pulled up Qumulo’s socket code and quickly saw that when listening for connections, we always used a backlog of size 5.

1
2
    if (listen(fd, 5) == -1)
        return error_new(system_error, errno, "listen");

During cluster initialization we were creating a connected mesh network between all of the machines, so of course we had more than 5 connections created at once for any cluster of sufficient size. We were SYN flooding our own cluster from the inside!

We quickly made a change to increase the backlog size that Qumulo used and all of the bad performance results disappeared: case closed!

Want to work on solving challenges like this?

Suggested Further Readings

blogimage-showback-title

How to get users to delete data

When enterprise storage runs out of capacity, getting people to delete data can be a challenge. Find out how Qumulo makes capacity planning easier.

Read more
blogimage-restapi3

Making 100% code coverage as easy as flipping a coin

We wanted a way to execute every possible path through a particular piece of code so that we could verify that, no matter what, the behavior of the code was correct and the invariants of the system held.

Read more
blogimage-keystore-title

Building a Distributed Key-Value Store

How Qumulo built a new distributed key-value store for defining the composition of the protection system used by our clustered file system

Read more
224x224
Ted Carpenter
Member of Technical Staff
Qumulo is the Ferrari of enterprise storage

Ted Carpenter is a developer at Qumulo focused on filesystem performance. He enjoys looking at detailed storage profiling data and crushing performance bottlenecks.

224x224
Chris Cole
Member of Technical Staff
"I'm fascinated by distributed systems, performance optimizations, and the cloud. At Qumulo, I get all three at once! What more could I ask for?"

Chris is a developer focused on bringing high performance enterprise-grade file storage to the cloud

Let's start a conversation

We are always looking for new challenges in enterprise storage. Drop us a line and we will be in touch.

Contact Information

REACH US

EMAIL

General: info@qumulo.com
PR & Media: pr@qumulo.com

WORK WITH US

SUPPORT

Search

Enter a search term below