By Daniel Pehush and Gunter Zink
The Mission – Caution, bumps ahead
Picking up from our previous blog post, with the performance squared away, the next important piece was selecting the vendor(s) for our storage devices.
With most of the vendors we spoke with Qumulo had already bought from, or had experience with their SATA SSDs. Our experience performing qualifications yielded the following results:
- Vendor A: Good reliability track record, best performance, not cheap, good supply
- Vendor B: Good reliability, fast enough, expensive, supply chain challenges
- Vendor C: Failed internal testing
- Vendor D: Failed internal testing
Well, only one option fit for what we wanted to deliver to customers for launch. This was obviously not ideal and we’ve stayed in close contact with these vendors to validate future candidates for use in our products. We spent an appropriate amount of time with each vendor trying to make their solution work for us, but in the end, only had one great solution.
The chassis we chose is a complete solution from a vendor and comes with excellent features, including: tool-less field replaceable fans, tool-less drive carrieres, and tool-less PCIe Riser adaptors that give access to our PCIe switch cards and PCIe NIC cards. The power supply units (PSUs) are even designed smartly, such that once a PSU cord is installed, you cannot push the switch to unlock a PSU from the chassis without removing the PSU cord. No PSU removal is possible while the PSU is powered. This design is a great way to prevent user accidents. The only components we put in from a customization standpoint is the drives and the NICs – what ease! It is a well-made system that exudes enterprise readiness, and we are glad to make it blazingly fast for file with our software.
Obstacles on the tracks, or should I say on the road to development?
PCIe Lane Architecture
So the stage was primed for Qumulo’s first all-flash hardware platform to get qualified and delivered to the software teams for development to begin. The chassis is fully integrated (L9) from a vendor, and has 24 slots for NVMe devices. This is all, as one might colloquially state, leading-edge technology, but NVMe has been on the market for a while now, so things were expected to go smoothly. In addition to NVMe maturity, we felt confident in choosing a chassis that is already established in the marketplace. The teams at Qumulo have plenty of experience delivering custom platforms, so delivering an off-the-shelf platform should have fewer wrinkles, right?
After we have already submitted a PO to receive a number of the servers, it came to light that, while the 2U chassis is being sold with 24 bays for NVMe U.2 devices, only 20 of the bays are wired up for actual use. We had planned to make use of all 24 bays, and so began a PCIe architecture discussion of how we could get access to all of the bays. To start, CPU1 has a total of 16 lanes available of PCIe Gen3, while CPU2 has 32 lanes of PCIe Gen3. Each CPU has a quantity of 2 Occulink PCIe x4 ports. To address 24 NVMe devices, each device needs 4 lanes of PCIe Gen3, which is a total of 96 lanes that need to be run to the backplane where these drives plug into. We also need 2 sets of 16 lanes to service our two dual port 40GbE/100GbE NICs.
If you’ve kept all those numbers in your head, we need a total of 128 PCIe Gen3 lanes and have 56 lanes to work with. Those numbers don’t add up! Of course, the design of the server is already as such to address this with PCIe switches. If you’re unfamiliar, PCIe switches can create multiple sets of lanes out of a smaller set to allow sharing with multiple devices. The server came with two PCIe switches that took 8 lanes in, and fan out to 8 total NVMe U.2 devices, or 32 lanes. So now our total lanes available was 104 lanes. As stated, the vendor only intended for 20 of the slots to be addressable, so if we were to cut out those 4 devices (a total of 16 lanes), we would have 104 lanes available, for a need of 112 lanes. Finally, we could have reduced our NICs from x16 devices to x8 devices, halving our network bandwidth desired for our customers, and then we would have enough lanes to satisfy the the platform. This … was not a solution that would satisfy our customers.
Then we ran into a mechanical restriction. There are 3 PCIe risers in the systems. These are PCBAs that feed the PCIe lanes into optimal mechanical organizations. The system was originally designed for Riser1 to feed three x8 PCIe adaptor cards, Riser2 to feed three x8 PCIe adaptor cards, and Riser3 to feed a single x4 adaptor card. There wasn’t a physical place for us to even install a x16 NIC card like we desired originally! At some point we discussed, as a way to get enough lanes to the drives, if we could halve the bandwidth of the NVMe devices and attach them via 2 lanes instead of 4 (called bifurcation). The vendor came back citing that this would be using the devices outside of specification, so that direction was not an option.
We worked with the vendor to generate a solution that will delight our customers, and found there was another type of riser card designed; we validated this new PCIe riser, which had a x8 slot and a x16 slot, rearranged things mechanically and got to an optimal solution of a x8 slot and a x16 slot. Now CPU1 fed 2 NVMe devices via 2 Occulink PCIe x4 ports and one x16 NIC, and CPU2 fed 2 NVMe devices via 2 Occulink PCIe x4 ports, one x16 NIC, two x8, 8 port PCIe switches, and one x4, 4 port PCIe switches. If we do some quick math….
Desired lanes (128) = (x16 NIC) ✕ (2) + (x4 NVMe U.2 Drive) ✕ (24)
Available lanes (128) = (x4 Occulink ports) (4) + (CPU1 x16 for NIC) + (CPU2 x16 for NIC) + (CPU2 x8, x8, x4 -> x8 8-port switch x32, x8 8-port switch x32, x4 4-port switch x16)
Success! We have the PCIe architecture to take full advantage of the server without compromise! This would not have been possible without our persistence and vendor cooperation, so well-deserved kudos to our server/chassis vendor!
As we worked to choose a CPU, the vendor stated that the CPU we desired to use – the Intel Xeon Gold 6126 TDP of 125W – was too hot for use in the chassis. We needed to utilize a lower TDP CPU which only consumed 85W, so we then tested with the Intel Xeon Gold 5115. This had an undesirable drop in performance compared to the 6126 CPU. In addition, we configured the server in a way the vendor did not originally intend, and added 4 NVMe devices which they did not intend to support. For this system to deliver the correct performance and storage for our customers we had to get this configuration supported thermally.
We worked to have the thermal validation for the server re-qualified. It then came to our attention that the original thermal qualification for the node only supported temperatures up to 27C! We discussed with the vendor that for this to be a viable product, we needed to be able to support it in an environment that could be as hot as 35C (10C – 35C at 10,000 ft elevation is the standard temperature operating range for servers/appliances (ASHRAE Class A2)). Luckily the work here just involved a couple of meetings, changing the power supply model to one with more efficient cooling, and testing everything in the vendor’s thermal chamber, resulting in a qualified platform with the CPU and drive count we desired!
15mm vs. 7mm U.2 Devices
At some point, we were thinking of utilizing 7mm NVMe U.2 drives instead of 15mm NVMe drives. This would allow us to use a higher quantity of NVMe drives in a 1U or 2U platform. This idea was quickly put to rest when planning with our preferred drive vendors. The 7mm format seemed to be going away, or not being given 1st tier support. Not wanting to design ourselves into a difficult supply chain and unsustainable situation, we decided that 15mm U.2 devices would be the better storage device choice.
Kernel 4.13 selection
Next came time for putting on our software. Wait, that’s not the next step; the next step was just to see if we can get our vanilla OS to boot on the platform. We knew that with this leading edge technology we would need to take advantage of SW/kernel features that may not be present in our current OS/kernel combination. Qumulo runs on Ubuntu 16.04, whose base kernel version is 4.4. At the time, this kernel did not have the features needed to handle NVMe hotswap. So we moved to the Ubuntu hardware enablement kernel 4.8. This kernel has the features we needed, but a poor incompatibility with a video driver which would cause issues using a crash cart, or KVM, or allow the server to boot at all. We discussed for some time about blacklisting the video driver, but the loss of functionality would be too great. We performed another hop to a Ubuntu hardware enablement kernel 4.13. Ideally, we would move the whole product line to the same kernel, but some packages required for other platforms we ship would not compile, so we split our product base between kernels. Finally a kernel that seemed to have the feature set we needed and no glaring drawbacks were discovered so far …
Hotswap testing yields system fragility
At this point in the validation cycle, we got the system to a base level of functionality. One of the major validation points for any hardware platform is hotswap or hotplug testing. Devices that are advertised as such, must not cause too great of a droop or bump on the power lines as they are removed and inserted under various power situations. They must also work with the host software and be functional after a hotswap; not causing issues with the software. During testing, things got rocky. This topic is the large obstacle/mountain we faced during the validation of this product, henceforth we’ll be discussing this for a while.
During hotswap testing we identified a slew of problems which would prevent us from shipping the platform. These were all problems around how the hardware and software were interacting. Some of the failure modes which would manifest are listed below:
- Upon removal and reinsertion of an NVME drive, the software was unable to detect the reintroduction of the drive. The slot itself seemed to be disabled, so even installing another drive did not result in the system noticing that drive was installed. Hence the system would be down a drive until a power cycle.
- We found an instance where a removal of drive did not result in the device getting removed under Linux, and then all calls to the device would hang until the node was power cycled.
These issues prevent us from shipping the product.
To correct this behavior we would be required to use a feature that is called “Volume Management Device,” or VMD for short. This BIOS feature provides stability and LED management control for the hotswap. We worked with the vendor to enable this feature in the BIOS. A small change that prompted us to take a later version of lspci that been released to handle the new address lengths of devices surfaced by enabling VMD. Until taking that fix lspci was non-functional. Another small bump during this journey.
Great, so now we’re supposed to get the drive fragility issues fixed, and LED control of the NVMe devices. Did I mention the LED control didn’t work at all without VMD? Then we also had to modify the switch firmware which was being loaded to enable that LED control of the NVMe devices.
We then found the performance of the drives when being used by our software was reduced in some workloads by 60 percent. Did we mention some of the aforementioned drive fragility issues were still present? This is not really an acceptable change for a solution we wanted to deliver to our customers. Working with our vendor, and performing a variety of data collection and experimentation for them, having them physically come on site and work shoulder-to-shoulder with us, we identified several kernel patches to backport to address the drive fragility and performance issues. Lastly, it was identified that the switch driver, which was being loaded by default, caused a performance regression in the switch and working with the vendor determined it was unneeded for the platform.
While this problem and resolution was detailed in a couple paragraphs above, this was a several-month investigation, with many discussions with our vendor, and many experiments and testing conducted to validate a solution.
Early firmware causes higher-than-expected component death
This was another pebble in our shoe while all these investigations were ongoing. We had a couple drives die on us. There is always some expectation of infant mortality with any component, but for the quantity of component failure we experienced, we got suspicious. It turned out that the vendor had already been made aware of the issue and when we brought it up there was a firmware fix available for the component. Lesson learned, always ask for the latest firmware upfront and often.
Second sourcing components go awry
During bring-up, one always wants to have multiple sources if they can for a component. In this instance we attempted to qualify another vendor for our U.2 NVMe devices. This led to an investigation where the drives would fail to re-negotiate with the platform upon removal and reinsertion. This may sound familiar to the issues described above but, it is a completely different issue. Capturing a bus trace and error codes from the drives we found that the root port was sending commands to a drive after a hotswap that had an incompatible PCIe payload size. The motherboard in this instance was the culprit. This took a while to identify, as we suspected the drive and the switch before figuring out the motherboard was breaking the PCIe specification. In the end we identified a Linux grub parameter we could set to force the software to do the right thing and make this second source component viable. Success! Then the performance turned out to be such a regression from our primary source, we couldn’t use them! Bummer! Another lesson learned, before diving deep on an issue, make sure there isn’t an easier way to identify stop ship before spending too many cycles.
Remember in this post where I talked about enabling the platform with 2 dual port 40GbE NICs or 100GbE NICs? When you’re building a Ferrari-like server which costs $$$$, sometimes the amount of money you save by allowing customers to order a component based on need, ends up costing your customer more in support, and you more in supporting multiple components. Late in the development cycle we decided to move to one NIC option to allow for a smaller testing matrix and providing both 40GbE/100GbE from a single component. Qualification and support is easier, an easy win.
The result of all of this is we delivered an all-flash platform capable of file over NFS or SMB which can deliver 3 streams of 4K uncompressed video per node. Hence the smallest viable cluster of 4, P-series 23T nodes is capable of 12 stream of 4K uncompressed video! With this hardware being capable of more, there is continual work by our software teams to pump more performance out of the fastest platform Qumulo had delivered to the world to date. Efforts towards performance improvement are being spent on our all-flash platform.
This is the highest quality platform we’ve shipped to date. And, we were able to leverage a platform wholecloth from a vendor to do so. Taking advantage of another company’s engineering effort and resources to compliment them with our own validation and software. With all this performance, we also created an energy-efficient platform, drawing ~250W when idle, and under the highest multi-stream workload we’ve thrown at it, we’ve recorded ~650W in power draw. Part of this power savings is due to the lack of spinning storage media, as well as sophisticated power management in the CPUs and the NVME drives.
Due to our iterative software development cycle, we will continue to improve this product after we shipped it to customers. Our software developers do not sit idle; they are constantly striving to provide the best product, and the best continual improvement experience on the planet. We hope you’ve enjoyed this journey of hardware platform development at Qumulo!
Many thanks to the editors of this three-part series: Michaela Hutfles, Luke Mulkey and Jason Sturgeon