It is an occasional and unfortunate fact of life for a system administrator that someone, sometime will move a directory containing many (possibly millions) of files to another location on their filesystem, only to realize they are unable to remember where they put them.

For example: One of our Media customers relayed that artists using Wacom tablets frequently and accidentally drag a folder onto another folder without realizing it. In a file system with a hundred million files it can take days to figure out where the folder went.

Our customer challenged us to help with this problem, so we came up with a solution using Qumulo’s cluster snapshots feature and our associated REST APIs in a python script that:

  1. Looks in snapshots on a cluster for a specified directory path
  2. If found in snapshots, used the Qumulo file ID from the most-recent snapshot where it existed to find it anywhere on the Qumulo filesystem by that ID

Qumulo stores the the file ID in metadata and maintains an index of IDs independent of file system path. So because we can quickly find a known path in snapshots and return the ID for the path, we can use that ID to find the file anywhere in the current file system with a single REST call.

The Script and a Real-World Test/ Example

For this example I’m using a test Qumulo cluster with 34 Terabytes of used capacity and over 1.3 million files and directories. My music collection is stored on this cluster, and in my music collection is my precious collection of music by William Shatner (he sang Elton John’s Rocket Man, among other hits!). I can view my music on my Qumulo cluster using the Capacity Explorer in the UI, by mounting my music directory locally via NFS or SMB or by using the Qumulo REST API, like this:

> qq --host Cluster1 login -u user -p passwd
> qq --host Cluster1 fs_read_dir --path '/music'

Which (eliding lots of text) returns:

    "child_count": 9,
    "Files": [
            "path": "/music/William Shatner/",

Later, I decide to listen to my William Shatner music again, and to my distress, the Shatner music directory is no longer in my music. And I have no idea what I did with it or whether it is still on my cluster or not…

Normally, to find a directory on this rather large cluster, I’d use a POSIX utility such as find and run something like this:

> time find /mounts/Cluster1/ -type d -name William\ Shatner | head -n 1

On a cluster of this size (34 TB with over a million files and directories) the above find command hadn’t finished after many hours. But using Qumulo REST APIs (and a little a priori knowledge of where I had it before… in ‘/music/William Shatner’) I was able to find it on my cluster in less than a second with a simple python script that makes a couple of Qumulo REST calls:

> time ./ --host cluster1 "/music/William Shatner"

/music/William Shatner from snapshot 2 can now be found at /Lab_1/Test_Set_A/2016-12-1/Data/1/1/1/1/1/1/1/A/1/1/1/1/1/1/111A1/1/1/SCRATCH/A/1/1/AA/111/10A/11/1/AA1111/1/E11/YAA10/tmp101/AA.101/2016-11-11/music/William Shatner/ on Qumulo cluster.

./ --host cluster1 "/music/William Shatner" 0.06s user 0.03s system 15% cpu 0.590 total

This is much, much faster than crawling a potentially-enormous file system to find the directory or file you are looking for.

Next Steps

To try this yourself, check out the python-based snapshot_scripts sample which you can find on the Qumulo GitHub site, and let us know if you have questions.

Share with your network