Qumulo for AWS moves data sets to a Qumulo cluster on AWS, renders them in cloud-based nodes, and then moves them back to an on-premises Qumulo cluster. This workflow is one that many Qumulo customers regularly experience, and I’d like to share some of what I learned in helping them accomplish this with you.
Too often, when you’re rendering a large job, or perhaps many jobs, it becomes painfully obvious that you need more resources. This realization usually occurs as your deadline looms dangerously on the horizon. What do you do?
Calculating Rental Nodes vs. Cloud, Cost vs. Time
It’s fairly straightforward to calculate the cost of rental hardware nodes versus compute/storage time on a cloud provider. There is a break-even point. When you take into account the time it takes to order, deliver, and rack and stack the nodes, not to mention the challenge of finding available rental hardware, as well as enough data center space, power, networking, and cooling, the cloud starts to sound like a pretty good alternative if the demand is high enough. So just how do you burst your rendering to the cloud? There are object gateways, but all the most commonly used rendering applications are file based and who wants to deal with that mismatch? With Qumulo for AWS, and some configuration, it can be done!
Determining the Infrastructure
Essentially, you want to extend the physical on-prem render farm (and all the accompanying infrastructure) into the cloud. NFS/SMB over a WAN link can be cripplingly slow because of latency. On the other hand, a cloud cluster that can serve files locally to the cloud render nodes is reasonable to set up. Data sets can be replicated to the cloud and the results moved back. Obviously, different levels of compute are available in the cloud and this should figure into your cost calculations. Pay more for faster, more powerful compute or pay less for slower, interruptible resources.
You should also think about setting a reasonable checkpoint in your renders. If you choose a tier that can be interrupted, restarting the renders from the last checkpoint can be easy or painful depending on your configuration.
You can automate the configuration of your cloud resources either with scripts or with deployment automation tools. There are plenty of packages out there and rolling your own is not that difficult.
Correctly Configuring the VPN
Of course, security is a primary concern, so you need to correctly configure your VPN. That connection provides the command and control communication to the cloud nodes and allows them to check out licenses from your on-premises license server. OpenVPN is great, and clients are readily available for both Linux and Windows. Some firewalls even support it natively! You’ll need to distribute keys and configuration files to each cloud node. You can also restrict the IP connectivity of the cloud instances so that only your on-prem network can access them (and vice-versa–you want only your cloud instances to access your network).
Accessing the License Server
You can’t render without a license server! Unfortunately, most licenses (and license servers) are keyed to a physical MAC address, and you probably have one or more already established in your environment. Spawning a virtual instance in the cloud is possible, but you’ll get a different IP (and MAC address) each time you start it up, which is painful if you need to get new licenses. By using a VPN, you direct all licensing queries from the cloud back to your infrastructure over a secure channel. (This assumes you have floating licenses available for the cloud render nodes.)
Using Queue control
How do you manage and control the renders? In the past I’ve used Deadline, but any queue management software should work. The VPN connection provides connectivity back to your queue manager for the cloud instances and they should show up as regular clients (assuming you install all the appropriate packages). Here again, licensing works over the VPN connection. It makes sense to configure a separate group for only the cloud nodes.
How do you get your data to and from the cloud? With Qumulo, replication is easy to configure. You set the directory and start or schedule the job. Data is seamlessly replicated from an on-prem cluster to a cloud instance of Qumulo. Again, this traffic can flow over the secure VPN connection.
Now that the infrastructure is in place, let’s get some scenes rendered! Replicate a dataset that needs to be rendered from your on-prem cluster to the Qumulo for AWS instance. The render nodes have an NFS export (or SMB share) mounted from the cloud Qumulo instance and are therefore mounted “locally” in the cloud. It goes without saying you want all the render nodes and the Qumulo instance in the same region. Fire up the queue manager and send a job to the cloud nodes. It should work the same way as the local nodes do. Once the job is complete, replicate the resulting frames back to your on-prem cluster.
Once the job is finished, you can either shut down or terminate the Qumulo cluster and render nodes. If you shut the instances down, they will only accrue cloud storage charges. Fire the cluster back up again the next time you need to burst to the cloud. Alternatively, you can terminate the cluster and set it back up again when you need it.