By Daniel Pehush and Gunter Zink

The mission: The engine we chose

In our last post in this series, we talked about how our team set out to build the highest bandwidth, all-flash node possible using off-the-shelf components. There were several things to consider once the hardware technology was finally ready for our software-defined vision.

After analyzing available CPUs, we picked the Intel Xeon Gold 6126 CPU, which has 12 cores / 24 threads at 2.6 GHz base and 3.7 GHz max Turbo, with a TDP of 125 watts. This was chosen for its higher frequency, lower core count and power rating.

The next key was to balance the design in terms of bandwidth, front-end network bandwidth, back-end network bandwidth and local disk IO bandwidth; all optimized for the Qumulo software architecture.

In a Qumulo four-node cluster, a given client system connects via NFS or SMB to one of the nodes. With this connection scheme, 75 percent of the reads for this client come from other nodes in the cluster via the back-end network, and 25 percent of the data comes from disks in the node to which that client is attached to.

When a client writes to a single node, that data is then erasure-coded and distributed to the other three nodes in the cluster via the back-end network. The bandwidth needed for the back-end tends to equal the front-end network (assuming large writes). When many clients are connected to all nodes in the cluster, we ideally have the same front-end network, back-end network and local disk IO bandwidth.

Looking at the components and predicting where the performance bottlenecks would be, an easy target was the network.

We considered putting three dual-port NICs in a single server! That idea was a bit out there, as we would have produced a box with an uneven network balance for front-end or back-end traffic, and need to develop software to shift the traffic to the third NIC depending upon which network pipe needed it. However, it became apparent that this was not worth the software development effort and was quickly scrapped as an idea.

We decided that a blazing fast, all-flash platform needed two dual-port 100GbE PCIe x16 NICs. This was not only good enough for an initial fast release, but also allowed for performance headroom as the software was optimized down the road, preventing the hardware from being the bandwidth limiter for the platform.

Performance by the numbers

Now, let’s walk through some of the juicy hardware performance numbers.

For a PCIe Gen3 lane the theoretical maximum is 8GT/s or 985 MB/s, the NIC is a x16 width Gen3 card, 16 total physical lanes, having a theoretical max of 15,760 MB/s. For a dual-port 100GbE NIC, each port is capable of 100Gbps, which equates to 12,500 MB/s. We have two ports for a total bandwidth of 25,000 MB/s. That’s the Ethernet bandwidth, on the PCIe side, the NIC is a x16 width Gen3 card, having a theoretical max of 15,760 MB/s. These obviously don’t match up, but it’s the best available on the market before PCIe Gen4 is widely spread. So the PCIe bandwidth to the NIC card is the true bottleneck here.

Due to software and some of the overhead on the protocols, we round down the theoretical max of a single PCIe3 Gen lane from 985 MB/s to 800 MB/s per each receive and transmit pair.

Using our rounded numbers to account for overhead our PCIe bandwidth to the NIC is 12.8 GB/s. In our software, we split the back-end traffic and front-end traffic so our front-end bandwidth limit for client connectivity is 12.8 GB/s and our back-end network for intra-cluster connectivity bandwidth is 12.8 GB/s.

So what is our NVMe drive IO bandwidth?

We decided to make two SKUs to provide our customers storage density choices: One with 12 drives, utilizing half the slots in our chosen chassis; and one with 24 drives, utilizing all the slots. Each NVMe SSD has four lanes of PCIe Gen3 going to it, meaning the bandwidth with our rounded down number is 3.6 GB/s. Each drive available on the market today is not capable of saturating this bus, but the standard for connecting U.2 devices is such to allow for faster NVMe SSDs to be utilized in the future. While 3.6 GB/s is what each drive can offer, the motherboard we chose does not have the available PCIe lanes to address all drives at full bandwidth to the CPU.

This is where PCIe switches come in.

On the motherboard, we utilized four onboard occulink ports: One four-port PCIe switch that is a x4 card, and two eight-port PCIe switches that are x8 cards. Each port from one of these devices is x4 lanes of PCIe Gen3 for an NVMe device, so these devices attach at full bandwidth to our data storage devices! The story on the other side is not so simply attached. The occulink ports are direct attached to the CPU and they are full bandwidth from the CPU to NVMe SSD. The four-port switch is x8 lanes ,and the eight-port switches are x8 lanes yielding 6.4 GB/s.

The way things are wired resulted in the following way for the 12-drive version:

  • One NVMe SSD’s bandwidth is 3.6 GB/s
  • Four NVMe SSD’s bandwidth is 14.4 GB/s
  • One x8 PCIe switch can deliver back to the CPU 6.4 GB/s of bandwidth

There is more bandwidth for the storage devices than the switches, so the switches limit the bandwidth.

12-drive configuration:

  • One x8 PCIe switch with 4 NVME drives
  • One x8 PCIe switch with 4 NVME drives
  • One x8 PCIe switch with 4 NVME drives
  • Max IO bandwidth: 3 x 6.4GB/s = 19.2 GB/s

24-drive configuration:

  • Four ports of Occulink to 4 NVMe SSDs
  • One x8 PCIe switch with 4 NVMe SSDs
  • One x8 PCIe switch with 8 NVMe SSDs
  • One x8 PCIe switch with 8 NVMe SSDs
  • Max IO bandwidth: 19.2 GB/s

With the above configuration, if you’re doing the math, you would determine that the hardware bandwidth would be higher for the 24-drive configuration than the 12-drive configuration. While this is true, our software utilizes drives evenly. The x8 eight-port switch to CPU PCIe bandwidth is the limiting factor, as that is the 6.4 GB/s of bandwidth.

On the 12-drive configuration, the max bandwidth available to any NVMe device is 1.6 GB/s. The 1.6 GB/s numbers comes from the fact that a x8 switch has a bandwidth value back to the CPU of 6.4 GB/s which is split between 4 NVMe devices (6.4/4 = 1.6). The same principle of utilizing drives equally is true with our 24 drive configuration, but now we have 6.4 GB/s split between 8 drives, hence 800 MB/s per drive. So our software limits the hardware bandwidth, and the Max IO bandwidth ends up being the same for the 12-drive configuration and the 24-drive configuration.

Both of the configurations of all-flash are limited by the x16 lanes of PCIe Gen3 going to our 100GbE NICs. Such is the state of hardware technology today for x86 based platforms.

Stay tuned for part three of this series!

Share with your network