Analytics practices for streamlining research and infrastructure operations

The Institute for Health and Metrics Evaluation shows how the native Qumulo analytics give insight into their research data. They also go a step further, showing how they use the Qumulo API to create their own, comprehensive dashboard to report on their entire Qumulo fleet.

https://qumulo.wistia.com/medias/dmyii7yl46

Transcription of webinar

Hi. My name is Felix Russell. I’m with the Institute for Health Metrics and Evaluation at the University of Washington in Seattle. And I’m here to tell you about the Analytics Practices for Streamlining Research and Infrastructure Operations that I’ve encountered in working in large HPC in conjunction with Qumulo. So a little primer on our organization, we’re funded through a combination of public and private grants. The greatest majority of our money is actually provided through the Bill & Melinda Gates Foundation who have been very helpful in our inception and continued growth.

Our goals are to evaluate mortality and risk factor metrics for a variety of diseases and life-year adjustment causes, so finding out what will subtract the most from your life if you contract it is a quick way of thinking about it, aggregating health metrics from a variety of academic sources. And we also do healthcare service effectiveness evaluation. And we find out if, for example, a country’s nationalized healthcare service is effective in doing its job for the amount of money that it’s paying to provide healthcare to its citizens. So, that dovetails nicely into our clients. Our clients are also large philanthropic organizations who use the data we provide and large government health ministries, academic institutions as well. We write academic papers and we are very, very heavily cited which we are proud of it.

At the end of the day, our products are visualizations and academic papers. So at IHME, we use a variety of software tools for modeling, that’s not going to be the focus of my presentation at the moment. I’m on the infrastructure team and I’m focusing on the back end, how to get the researchers, the tools they need to succeed in their modeling, and geospatial activities that make these nice, pretty graphs and visualizations that you see here on the right. The build pipelines that are used by other teams and by us are the Luigi, Jenkins and GoCD. We use a variety of database products to back our visualizations and our transformation pipelines inside HPC. We use Percona and MariaDB products, as well as some really standard SQL and Postgres.

For web, our products are visualized using HTML frameworks that are a combination of home-grown and open source. So, at the Institute for Health Metrics, we have a large pool of hardware that is split into several clusters to help us achieve our modeling objectives across our clusters. We have 500 Heterogeneous x86 compute nodes which come out to about 25,000 cores that’s across generations and across architectures from AMD and Intel, and roughly 150 terabytes of memory at our disposal.

So, Qumulo has a great history with our organization. They’ve provided us four clusters that we have deployed across two datacenters. We have a speed tier consisting of 158 terabytes on our QC24 platform, that’s the one new platform from Qumulo that comprised of 11 nodes. And for scratch tier, we have about three Petabytes of QC208 nodes. There’s 21 of those, and they are providing the vast bulk of the scratch storage needs. We’ve had a good experience with the Qumulo. They have a great history of proven fault tolerance in the face of failures and large loads. The upgrades are frequent and painless. The snapshot policy enforcement is robust and it’s easy to do even for an end user, and we like that because it lets us give the task of retrieving snapshotted data to the end user, and not have to deal with it on our infrastructure or DevOps team.

The customer service has been excellent. We have a great relationship with the team. The customer facing an engineering teams at Qumulo who have been gracious with their time and their effort even at non-standard hours. And they, of course, provide excellent metrics and APIs for interacting with the cluster and seeing what it is doing. So, the Native Qumulo Metrics are gonna be the focus of this slide, because I’m going to be comparing what they do to what you can do with the API. The Native Cluster dashboards which are the main web address for you to log in to and manage the clustering, the web GUI show basic times areas information with throughput and IOPS obviously, as well as finding current hotspot data to see what files are currently being written to or read from the most.

The DataViz labs, not shown here on the right, are a convenient feature that Qumulo is working on currently that will show you aggregated cluster information as well as more deep historic trend information. So, the Institute for Health Metrics, we have very disparate monitoring and logging tools and they all have different roles and we are attempting to converge on one solution. We’ve decided that the Elasticsearch, the ELK Stack, is desirable and it’s great because it is interactive development. It is good at alerting, and it is quick at searching due to its losing back end, and it’s easy to orchestrate its creation and scaling using Rancher, which is what you see here on the top right. The dashboards which are displayed in Kibana are right below it and that is the dashboard, for example, displaying the configuration management suite, Salt and its airs across our environment so we can improve our configuration management. It is a powerful tool for graphing and aggregating lots and lots of data.

ELK Stack is very good in its fault tolerance, its throughput and document volume. The query times are very fast. You can age out your old documents and you can depend on it very well. And to Decoy, I included the downside, that it is too good and that’s addicting. In our environment, Elastic Search provides Syslog Aggregation and Search which is very convenient for sensing patterns and for finding log entries very quickly. Host and host-group metrics for top data as well as sour data, and for looking at our HPC scheduler across time with slots free. And now, we are supporting the ingestion of metrics from our scratch clusters.

So, this desire for convergence and all the information in one place spawn a project called Qumulo-analytics-elasticsearch. And it allows us to take the data from the Qumulo clusters at our disposal and aggregate all that data in one place. And it gives us cross-cluster aggregation metrics per-client and per-path performance, hotspots, capacity trend tracking and it gives us our own definition on how long we want to retain the data and how accurate or what the interval is of the data we want to retain as it ages out. It gives us a lot of flexibility on our monitoring. It’s very nice because this project here is available, it’s on GitHub. It’s open source. It’s a little Python application, and it’s very easy to get up and monitoring even without a production scale elastic search cluster.

You can use a small dockerized deployment of the ELK Stack endpoint your Qumulos, your Qumulo clusters, API logged data into your laptop for example, to test out the website as right at the bottom here. And in the spirits of that, I’m going to show you a brief demo of what that looks like in action. Up here, we have the largest paths in a given cluster separated by a cluster. And over here, the largest paths are visible across clusters. So you can see the data integrated here which researchers, for example, are the biggest offenders on who stores the most files or who stores the most data. It’s easy to spot trends across the clusters like this for specific clusters. Read and write throughput metrics are also trackable right here and the more detailed file and metadata to IOPS, you can keep an eye on historically here. The time series of the data is easily definable right there. If you want to drill down to a more specific time, just click and drag and you will have the matching data re-rendered accordingly.

Down here, we have the throughput for the right hosts, the fact that this host name is at the top is a good sign. We have a great deal of data that is being migrated and it means that this host is using the most traffic across all of our Qumulo clusters, consuming the most data and is reading and writing the most heavily in throughput. There is that metric for write and for read, and here we have the top throughput for files. This is to see hotspots, the current defined time series time to see what files are being written to or read from the most actively. And this is just an example of what you can do with the data from the Qumulo-analytics-elasticsearch project.

The methodology for this project was pretty simple, using Python collections and sockets libraries underneath Python and the Qumulo_api client, the REST client that is embedded inside Python. And elasticsearch-py which is another REST client wrapper for Python, and the data that the Qumulo_api endpoint fits in raw are easily visible here from the top right. It’s sort of very raw unsorted data. The script is just good at reading that in and forwarding it to elastic search, you know, more useful manner. And that concludes my presentation, and I wanna say a special thanks to my mangers and to Qumulo for allowing me the time and resources necessary to make this happen. And thank you for taking the time to watch the presentation.