Data Protection on QF2

Transcription of video

Hi, my name is Jason Sturgeon. I’m one of the product managers here at Qumulo and one of the areas that I cover is protection. Today, I’m going to take a few minutes and show you how to build a QF2 cluster, set the protection level and walk through some failure scenarios so you can see how the availability of QF2 is handled with your data.

All right. So, here I’m going to go ahead and accept the user license agreement, and this is what you would actually see on the console of a Qumulo node, hooking up a keyboard, mouse and video to that console. Here I’m going to give it a name “big-cluster,” and you can see that as I uncheck or check nodes, it will tell me the exact capacity of which you can write data to this cluster. I’m gonna select 12 nodes here.

The cluster’s recommending that I use two drive protection and the data will be safe with that, but I can increase that protection and it will tell you how much less capacity I will have by doing that. And now I set the password for the administrator and click, “Create Cluster.” And now, I’m going to create a cluster of two petabytes of capacity.

So, next, we’ll going to go ahead and map a drive and we can see the shares that are here as a share called “Files.” That’s just the default, and you can change that or delete it. I’m going to connect to one of the nodes and browse to it, and here I’ll see a share called “Files.” I’m going to go ahead and right click, and map a drive to this.

And now, I’ve got a drive mapped to the entire cluster. That’s two petabytes of capacity. Let’s look at that from a client’s side. So, that’s two petabytes base 10. What a Windows client sees is two petabytes base two, which translates to 1.76 petabytes.

So, I’m going to go grab some temporary data here just so I can start a file copy operation, so you can see that as different components in the cluster go down or fail, I’m still able to access the data. So I’m going to go ahead and go back to my mapped drive here and paste this large file after creating a test directory.

So, now, I’ve got a copy running. Let me just show you the dashboard again here and let’s check to see that we’re seeing the data right. So, as you can see in real-time, we can see the data being written to the cluster. And now…oh, a drive just failed. So, we immediately get a notification that a drive has failed. We can click right on that notification and we can see the details of that node. So, here I can see my eight terabyte drive has failed and exactly where it is.

And at a cluster level, I can see what’s going on and right now the cluster is reprotecting from that failure. My file copy is still continuing and I’m reprotecting the data. So I’m using the extra parity information that was written into the cluster to reconstruct the data from the drive that has failed.

So, this will complete and then the data will be safe. As you can see, it’s also telling you that I can have two more drive failures. So, we’re running what’s called “3 drive failure,” so we can have any three drives fail or we can have a node plus a drive failure in this mode. So, the drive is reprotected. Now we’re going to rebalance the remaining data. So, that just deals with the fact that we have a little bit less capacity on one of the nodes in the cluster. And now this is complete.

So, I’ve done some time acceleration here. I’m going to go ahead and replace this drive. And now, data is not at risk at all, but you’ll, will see a rebalance operation briefly and then it will complete. Again, it’s just rebalancing to make sure the data is evenly spread across the cluster.

Now, I’ve time accelerated some items here, but I’ve only time accelerated the data protection options by three times, just so that it doesn’t take a long time to show this demonstration. I’m going to go ahead and start copy another large file. So, I’m going to do something that takes a little longer here. I’m going to take down a node in the cluster and then I’m going to remove a drive, or I’m going to fail a drive actually.

So, I want to make sure I’ve got enough data copying continuously here. So, a node has gone offline and we get communicated that immediately. And so, with the node offline we’re going to notify you. We’re going to notify support and they’re going to reach out and work to get that node back online. But in the meantime, it looks like another drive just failed.

So, now, we have a node offline and we have a drive failure, and the data is still safe, but we cannot sustain another drive failure without some possible data loss. Here, we can see the current state and time will get accelerated a little bit here as we bring that node back online. And when the node comes back online, we’ll start to reprotect against that drive failure that had occurred. And then, it becomes just like a regular drive failure. I’m going to reprotect that data and then we’re going to rebalance that data.

Now, the rebalance will take a little bit longer because data was being written to the cluster while that node was offline. So, in addition to rebalancing from the fact that we have a little bit less capacity in one node, we need to rebalance the data that was written to the cluster while that node was offline. And then, this will complete. And then, we’re going to go ahead and replace that drive.

Now, I’ve done time acceleration here. So, in reality, these took about a minute. Now, that is with very little data on the cluster. In reality that will take a bit more time, but it shows you the resilience of the system and the ability to understand what data is on that drive and only rebuild what data is actually on that drive, as these are eight terabyte drives, and eight terabyte drives in a standard RAID system would take a very long time to rebuild because there would be no intelligence in the system of what data is actually there and what data is not.

And there, we’ve reprotected against the cluster from both a node failure and a drive failure, and the system is quite resilient and the customer has control over how much protection they want. So, again, we recommended 2 drive failure, but the customer is able to turn up the protection to 3 drive and get more resiliency in the system if they wanted. And it’s very clear at install time how much extra capacity that will cost them to increase that protection. Thank you for your time.

Share with your network