Driving research with high performance storage

https://qumulo.wistia.com/medias/edmfzyg9rq

Video Transcription

My name is Nick Rathke. I am assistant director for information technology at Scientific Computing and Imaging, which is a research institute at the University of Utah. So, a little bit about the SCI Institute, we’re one of eight permanent research institutes that is part of University of Utah. We’re home to over 200 students, staff, and faculty. We have 20 tenure-track faculty drawn primarily from School Computing, Department of Bioengineering, Department of Mathematics, Electrical and Computer Engineering and virtually all of our faculty also have appointments in other departments.So, we’re a very cross-disciplinary group. We do a lot of everything, right. So, one of our claims to fame is our open source packages in scientific computing, and the leading one of that is called “SCIRun,” which is what we call a problem solving environment. We also do packages called “Seg3D, ImageVis3D, ShapeWorks, ViSUS, map3d.” All of this is open source. It’s all available on the SCI website if you ever want to check it out. I mean, that’s sort of our bread and butter is doing this research and image processing.So, a little bit about our pipeline as far as where our data comes from and what the packages do. So, we have image-based modeling, which includes data acquisitions. So, we take electron microscopy images, MRI images, any sort of medical imaging for the most part. Then we have all these packages like Seg3D and Cleaver, and ShapeWorks that deal with that. As we go around the circle, this is sort of our workflow and then we see some of our rendering applications like ImageVis3D and FluoRender that deal with GPUs and volume rendering.

So, we deal with a lot of different image types and, of course, in order to deal with all these image types, we have a lot of data. So, with all these different segments, working in both software development and also in data analysis means we have a large amount of really unstructured data. So, part of our problem is that this ranges from the very small of “HelloWorld” as a C++ program, all the way up to rabbit retina data sets as volume renderings, which is almost a four terabyte data set. That’s a single file size. So, we run everything from very small to very large. And in the process of doing this, we also generate a lot of temporary data, which is a huge problem to kind of keep around and keep track of. You know, when you’re dealing with these big image slices, to do what we do with it, there’s a lot of intermediary data that gets generated out of that.

So, this is kind of our layout of how we structure storage. In our world and probably in a lot of other environments, second only to your network is probably storage. It’s probably one of the most important things that you have in your environment. So, virtually, everything at SCI ties into our storage system. We have two separate Qumulo clusters. We have one at the bottom, which is our QC208s. We have four of those and we have seven of the QC24s, which is in the middle there. And that’s split out for a very particular reason is that the 208s on the bottom are sort of our overhead and general storage, whereas the QC24s are less expensive and it’s easier to write into a grant and put into funding. So, there’s a real reason why we split this out that way.

So, for clients, we run pretty much a wide variety of clients that all access this storage. Everything from 200-plus desktops, our web servers and web services, our email servers, we have a large tape backup library that backs up all the data across these two clusters. We have large-scale shared memory systems that run in the 160 CPU core range. We have a lot of little specialty systems that faculty bring in to do specific projects, and then we have two clusters, 96 total nodes split between two different clusters, one CPU only and one GPU cluster.

There’s a little line that runs over to the cheap and not so cheerful storage at the top. We don’t allow our users to actually write their temporary data directly to the Qumulo system and that’s because it’s all temporary data. The last thing we want is to pollute our big storage and our nice storage system with a bunch of files that the next day are just going to get over written or that somebody is going to forget about and then are going to sit there for two years and nobody is ever going to clean up. So, we write our temporary to a smaller system and then all of the final data actually ends up on our Qumulo system where it then can be backed up and people can do whatever their next step of their project is with that. In order with all these different connections, we pretty much cycle through on our network with just a poor man’s DNS and do a DNS round robin, and that’s worked out pretty well for us over the years.

So, this is a little bit more of a layout. It’s a little hard to see on that screen, but we have a 10-gig switch in between all of our clients and in our Qumulo storage. And we export all of our storage out to Windows boxes, Linux, and in our Macs. We’re 90% Linux and Mac. We have a few Windows boxes that we do SMB on for Samba. The really great thing for us in Qumulo was when we got Qumulo, we were able to end-of-life all of our active directory servers, which we’re not Windows folks so that was a great thing for us to get rid of our active directory. Then the way that you interact with the Qumulo file system, we’ve also dedicated two of our Linux boxes as control notes that have multiple 10-gig interfaces and we use those for data migration and to move data back and forth across multiple systems, and those are all via NFS. We also use our Linux boxes to run any management scripts through the Qumulo API, on top of there, and that’s worked out really well for us and been pretty efficient.

So, of course, we’ve got all this great storage, but data abhors a vacuum. So, if there is free space, it’s going to get filled. Especially if you work in higher education and work with graduate students, they will just fill it until it overflows, right? So, anybody who works with grad students knows this. So, this is kind of interesting because the steep line on the right hand side is when we deleted our temporary data at the beginning of the year and we’re beginning to move all of our data into production, which is right around this time last year. So, you know, of course, storage is in fact the limited resource. So, knowing where your data is going is really important not only for what you’re doing today, but for capacity planning in the future so that if you’re like us and are federally funded, and have to work through a grant cycle, you know how far in advance you need to start scheduling and trying to fit storage into your research budget.

So, this is off of our Q08s and we can see on the left hand side, a little highlighted column, a user has written a bunch of data, and below there we can see exactly which path that user has written that to. It’s about five terabytes a data and it’s really easy to figure out who wrote that and where that data is going just by simply clicking on one bar in a graph. The goal, of course, is to help enable researchers keep doing the research. You never want to tell a researcher, “I’m sorry, we don’t have space on our system.” So, capacity planning is kind of an important part of what happens. So, and we can see this at different resolutions. This is 52 weeks. We can also see this at a 30-day resolution, a 72 resolution as far as what the Qumulo system provides by default with the APIs, with Qumulo’s APIs. We can also dump all this data off to a separate system for additional analysis later on.

Okay, so, part of capacity planning, the next part of that as far as keeping your research going is capacity containment. For us, being able to figure out where the data is and how much is being used in real time is pretty important. So, this shows a project that’s about 30 terabytes called “Neuro.” One of the reasons that this project…it’s a project that involves deep neurostimulation for like epilepsy and seizure conditions. So, they’re doing a lot of neurostimulations. They have about the worst case of data IO you can possibly imagine doing this project. So, it runs on 10 Xeon5 servers with 256 threads is their output. Each one of those threads, so if you have 2,600 threads roughly, writes out 1 file that has an N16 value in it and then they run that all they long. And their performance was atrocious on it because they’re writing one file with one number in it.

And now, thanks to this system, actually, and some other performance tuning that we did, they’ve now written some Python scripts around that to sort of help mitigate some of their IO issues. But yeah, that’s about probably one small file with the one number in it, 2,500 at a time is pretty brutal. So, of course, part of this capacity containment is we have these great little scripts that every once in a while go through and they say, “Hey, you’re reaching your quota.” So, we wrote all these scripts in there with the Qumulo API. We’ve assigned people quotas and now not only on the IT side do we know what’s going on, but we can send out these scripts every once in a while to let our users know and say, “Hey, you’re in some sort of a quota violation state.” And we database on track all of this because the faculty want to know how their users are using their data and it’s easy for us to report on that and generate and write reports that are project specific and group specific within our environment.

So, of course, performance, we’re talking about a high performance system here. So, performance is, you know, pretty important after all, but if the system isn’t performing well, it doesn’t matter how much storage you have, right, if nobody can use it. If you can’t use your high performance storage, it’s zero. So, understanding and watching these metrics is a pretty important part of what we do. The top one there, you can see writes about 187 megs a second. It’s not doing so many reads. It’s kind of interesting because I took the top one at 9:00 a.m. in the morning. So, it’s kind of nice to see that some grad students don’t actually sleep until noon, but ultimately, what are these numbers mean, right? What is 120 megabytes or 187 meg mean to a faculty researcher, right? What that means is ultimately for them it’s time to insight. How long does it take for me to get my results back, right?

And for good performance, you know, these are great numbers, but it needs to translate into something that’s more meaningful than just a number for people. If you have poor performance, then what happens is you have angry customers and angry faculty, and generally, angry faculties don’t stay angry in their own office. They come to my office and they’re angry in my office, which is generally not a good thing. One of the other interesting things, and I just added this one this morning, is this spike in the bottom corner here. I saw this spike this morning when I was looking at our system and I said, “Well, what is that spike?”

And then Qumulo system is actually really easy to figure out. It’s that spike there, this morning, was actually the same user who did the big spike in the previous in…I’ll show you. It’s actually that white line right there. It’s the same user as wrote that data and that took me about 30 seconds to figure that out this morning that it’s the same user, and what the user is doing is doing a 2018…I’m going to read this right off this so I get this right. He’s working on the “2018 EEE Cyviz [SP] contest, dedicated to the visualization and analysis of deepwater asteroid impacts.” So, you know, considering that that’s not for a year, the student’s already working well ahead of his schedule, which is also kind of unusual for a student to be doing.

All right, so more on performance, Qumulo system gives us a really good way to analyze client performance. You can see one of the top green ones up there is doing a lot of reads. We’ve got some systems doing writes. The nice thing is we can drill down and very quickly see exactly which directories and where all that data is going. This really helps us from…when you’re dealing with 200 desktops and numerous other systems, figuring out where some system is writing to is pretty critical. This has actually led to some interesting phone conversations with Qumulo support because we weren’t used to seeing very large numbers. On our old system, we couldn’t achieve anywhere near the performance that we are in Qumulo.

And it’s actually been kind of funny because I’ve actually called Qumulo support in a panic going, “I’m seeing a client and it’s doing, you know, 30 meg to 40 meg a second. What do I do? It’s going to kill my cluster.” They’re like, “Oh, it’s fine. Don’t worry about it.” So, after a couple of, you know, fall starts when we started seeing performance numbers that we just simply were it used to, now we’re much used to it and we don’t have the support challenges that we did with calling them unnecessarily.

So, of course, then also part of performance, your overall performance is Uptime. Over the last year, we’ve had four-and-a-half nines on the system. We’ve had roughly 28 minutes of downtime in the last year and that was just to do patches and minor upgrades. So, from an Uptime performance standpoint, you know, it’s something that, for us, it’s been incredibly stable. So, one thing about Uptime, which is not such a funny story is by a minor flaw in our data center we have 120 minutes of runtime, but if we lose power we have no cooling, our cooling shuts down, which it will go from 68 degrees to 100 degrees in about 20 minutes.

Now, when we were in testing with Qumulo a year ago and we were just in the process of building out our production environment, that happened twice in one day from two different power outages in the Salt Lake Valley that were completely unrelated. On our Qumulo system, normally what happens with our big clusters is we have automated scripts that shut everything down. With the Qumulo cluster, we didn’t have that yet. So, I had literally went into the data center and yanked all the power cords out. Now, this is absolutely something I do not recommend you ever, ever do, but I did it twice in one day to the Qumulo system. Out of all the storage systems and all the disk arrays that we have, Qumulo is the only one that came back without any flaws, no data corruptions, no disk losses. It was the only system that handled that. On our primary core set file system, which is from another vendor, we lost 6 drives in 15 nodes. Our cheap and cheerful system was a complete loss. It lost every single drive.

So that, I think, was an interesting test, not something I would recommend you do, but we were able to do it. So, of course, because today we have this capacity planning and we can now see all this data, we now know where we have to go and how we have to plan this out. This is sort of our migration plan over the next five years. In year one, we’re going to add another QC24, which will be next year. The following year, we’ll add two more. Then year three, we’ll add one. At that point, we’re going to upgrade our network from 10-gig to 40-gig, and then we’re going to add another QC208. So, that’s kind of our migration plan and, of course, because it’s a scale-out solution, we can literally add all those in with zero downtime. We plug them in to the network, give it an IP, and it’s up and going, which is great for us.