Unlike everything in the software stack, hardware is a black box – the front door interface is the only thing that a user can observe. In some ways, this is a good thing. In software, testing can often present a challenge due to our knowledge of the implementation. In our quest for micro tests, we can sometimes lose sight of the “big picture” functionality.
In other ways, hardware being a black box can be a bad thing. We can’t fix every bug that we identify. We sometimes have to rely on vendors to patch things while we work around the observed behavior in software.
We still apply the Qumulo test ethic to hardware, but it often ends up looking different. Across our two data center labs, we host multiple nodes of every SKU we’ve ever sold. This includes small variations within a single SKU, such as two versions of a NIC, or two different SSDs. We run automated testing continuously against all of this hardware. Failures arising from this testing become sustaining work for the hardware team:
- “What happened to X version of Y NIC such that it’s now hitting twice as many TCP retransmits as it was last week?”
- “Why is VGA output black on node Z?”
- ”What does this Linux kernel traceback in these syslogs mean?”
These are the kinds of sustaining challenges software engineers like myself tackle in the hardware space.
Outside of this work, the hardware team spends most of our time dreaming up and building new platforms, the best of which end up in our customers’ hands. Since Qumulo’s software runs on a variety of hardware, we are free to pick and choose components without worrying about whether or not they will work. If the hardware exists, Linux already supports it.
Qumulo was first to launch an all-flash product with NVMe drives. To facilitate this, we had to do a little bit of lab work to retrofit our qualification machines to power fault test drives connected via NVMe. That done, we followed up by running a few enterprise-class NVMe drives through our battery of tests. After a few days of this, we knew with confidence that NVMe would work fine. We then worked with multiple vendors to configure a server to meet the performance and price per terabyte that our customers had been asking for. After a few months of software work to optimize our backend for SSD-only nodes, we took it to NAB where we were the only vendor on the floor demoing live, uncompressed 4K streaming.
In addition to our NVMe flash offering, last year we also brought our distributed file system to the Dell EMC PowerEdge family. Once we had the hardware online, we were able to adapt our software to support it in less than three weeks. After that, we gave the cluster to our certification team, where, as with all new platforms, it ran for a full four weeks to ensure high stability and quality for launch. During this time, we worked with our documentation team and customer success teams to ensure everything was well-documented for deployment and support, as well as to give the team some hands-on time with the hardware before asking them to support customers.
Qumulo’s architecture relies on hardware only so long as the hardware can provide a specific set of guarantees. These guarantees are what keep our customers’ data safe. We lean on Linux to provide us fast and reliable access to any hardware we want. Aside from vendor integration and sustaining work, we spend our time on the hardware team bringing up whatever we see fit to delight our customers, and handing it off to be certified and sold.