One of the things we’re proudest of at Qumulo is the relationship we have with our customers. We support them with a dedicated Customer Success team that communicates over Slack and on the phone, and uses cloud-based monitoring to resolve any issues customers have efficiently and as swiftly as possible. Customers are our magnetic field – at the heart of everything we do.
At Qumulo, creating, storing and building with data is our super power, and that means we measure everything – including our ability to meet our customers’ needs to solve their problems with ease. That metric across our industry is known as a customer satisfaction Net Promoter Score (NPS), and Qumulo’s NPS reached 91 in the last quarter of our fiscal year. Perhaps even more impressive is that our customer satisfaction rating continues to trend up as our customer base grows!
Now, I’m an engineer, and I didn’t recognize that term “Customer Success” when I joined the company a few years ago. Today, I understand it’s like “Customer Support” on steroids: proactive, solution-oriented and dedicated to ensuring the customer is truly successful in using our file data platform to achieve their goals.
Investigating Issues with Cloud-Based Monitoring
How does Qumulo’s Customer Success (CS) team solve thorny issues in the field so quickly? Well, many of our customers have enabled Cloud-Based Monitoring or “Mission Qontrol” (we have a thing for the letter Q here), which is a phone-home feature that sends a myriad of system health metrics to our data-analytics system. Internally, our CS team is able to pull up and visualize charts of health metrics against their data to get really detailed insight into the behavior of our system—which is designed to manage a lot of complexity for our customers.
To visualize the health metrics data, we use an open-source application called Grafana, which can pull from an assortment of data sources. In-house, we design the data pipeline that gets all the health metrics data from our customer clusters, securely stored in a database, and does appropriate transformations on it along the way.
Case in Point — Seeing the Problem
Recently, a biomedical research customer upgraded its Qumulo cluster, and a few days later the data administrators noticed that they had hit a limit on existing filesystem snapshots. We have a high limit on the number of snapshots, just to ensure some process isn’t getting out of hand—and indeed, here it was. But why was that? After all, the customer was using snapshots in a routine way—as part of our replication feature, which creates and deletes snapshots automatically, on a 1-minute cadence. Clearly, this was something that needed more investigating.
Using our Mission Qontrol Cloud-Based Monitoring dashboard, the CS investigators were quickly able to confirm the product was at its limit for snaphots, then identify that CPU usage was really high on a single node. In this instance, an extraordinary number of “set permissions” (setattr) operations were coming into that node. The customer was also able to see that snapshot cleanup operations were taking longer than usual.
With all of that in mind, they understood that setattr operations were rapidly creating a lot of backlogged work for the snapshot cleanup to do, and causing snapshots to slowly accumulate. The monitoring system holds thousands of health metrics for each node, yet the investigators were able to navigate through it all easily, through data visualization as shown in Figures 1 – 4.
How do we collect all this data on system health metrics?
When we launched with our first customers back in 2013, we knew that responsiveness to customer problems would be key to our success so we built a rough-and-ready system with key customer stats and alerts. Since then our cloud monitoring ability has grown so much smarter. We’ve expanded the number of health metrics that get reported to over 10,000 different metrics tracked per node, sometimes even per disk.
In the last year, we’ve continued to invest in this architecture by breaking up the service into several components with their own focus: a webserver to catch the incoming metrics, a distributed queueing system to buffer them and manage the fanout to many internal consumers, and a good analytical database to house the data and make it easy to query by the investigators.
Today’s Mission Qontrol cloud-based monitoring architecture supports data analytics efficiently with distributed queueing by decoupling data consumers from each other and the customer production systems.
For the queueing system, we chose RabbitMQ because it was easy to use, had the functionality we needed with a friendly API, and seemed to have a wide, satisfied, user community. We’ve been running it for about a year now and have found it to be very reliable.
With this flywheel in the middle, spinning out the data to all the data consumers, we can do so many things. We can look at specific customer problems, like that of the research institution we talked about earlier; we can do aggregate analysis with the billions of files stored in our clusters; we can evaluate how well new features are performing for our customers and identify further improvements we should deliver; and we can study how the usage of different product features has shifted over time.
And where is all this data stored? On Qumulo, of course. We actually have two Qumulo clusters, one in our data center and one namespace in the cloud, so we’re taking full advantage of the power of the Qumulo File Data Platform, and of course, “eating our own dog food.”