Public cloud infrastructure has transformed many aspects of IT strategy, but one thing remains constant: the vital importance of high availability (HA). When data is your business—as is the case for every business today—any loss can have dire consequences. You’ve got to do all you can to minimize that risk. Naturally, HA is a key area of focus for storage vendors both on-premises and in the cloud, but not every vendor takes the same approach. Understanding what the differences are, and why they matter, is essential to make the right choice for your data and your business.
On premises, HA commonly relies on a few clever network tricks. One of these is the concept of floating IP addresses—one or more IP addresses that do not solely belong to one device, but are shared between a cluster of devices. Clients use these floating IP addresses to access content being served by the clustered devices, so in the event of device failure, the client’s connection can seamlessly swing from one device to another. There are a few different mechanisms that can be used to swing floating IP addresses away from failed devices. For example, both F5 Networks BIG-IP platform and the Qumulo File Fabric use a technique called gratuitous ARP to take over a floating IP address that was previously served by another node. Other systems use asynchronous routing so that only a live device will receive traffic. In both cases, it’s the network itself that enables the functionality for seamless failover from a node with issues to a node that is functioning.
In public cloud environments, you don’t own or control the network. Here, it’s Amazon, Microsoft, or Google who get to dictate which features to enable. For Amazon Web Services (AWS), such choices include disabling ARP in order to prevent the risk of abuses such as ARP cache poisoning (also known as ARP spoofing or ARP poison routing). That means that any on-premises appliances you’ve been using that relied on ARPs for HA won’t work. As a result, infrastructure vendors need to find a different approach for cloud HA.
The options for HA in the cloud come down to two basic approaches: you can either find a workaround that’s essentially similar to what you’ve done on-premises, or you can write a new, cloud-specific method for IP failover.
An example of a workaround is the method NetApp ONTAP uses for IP failover in AWS. As a classic scale-up storage architecture, NetApp relies on paired nodes where data is constantly mirrored from node to node. In this case, you’re effectively maintaining two copies of your data store, incurring compute, storage, and software costs for both the used and the unused node. Think of it as a form of auto insurance where, instead of paying a relatively low monthly fee, you cover your risk by buying an entire second car in case something goes wrong with the first. These deployments can be run in either active/standby or active/active configurations; both require that data be replicated fully. Now, this deployment itself does not provide IP failover; for that, you need to deploy a third compute system called the NetApp Cloud Manager.
The Cloud Manager is a t2.micro instance (shown as the “mediator” above) dedicated to handling configuration of the ONTAP systems and providing failover. The Cloud Manager watches for a failure, and then swings IP routing from the active to the standby as needed. That sounds all well and good until we take a closer look at the t2.micro—an AWS EC2 instance type with just 1 vCPU and 1 GB of RAM. Making that the lynchpin of your HA strategy means moving from a single point of failure of the active node, to an even smaller single point of failure in the failover mechanism itself.
As an agile software company, Qumulo is in a position to really think through the right solution to each problem—no matter how hard it might be—and build it for our customers. Considering the complexity and risk of the ONTAP approach to HA in the cloud, we started from scratch and found a simpler and more reliable method.
Instead of trying to force-fit an on-premises model into a public cloud environment, we purpose-built IP failover designed specifically for the cloud. The key is to make use of the features available in each of the public cloud platforms. For example, we use AWS APIs from any working member of the cluster to swing a floating IP address from a downed cluster member to a functioning member. In this way, we avoided adding another layer of complexity, and also avoid introducing a single point of failure that can easily become a bottleneck. As an additional benefit, our approach eliminates the need for a redundant standby cluster, greatly reducing the cost of HA.
Now, you may be wondering why you should care at all about HA in the public cloud given assurances like these from Amazon:
“Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% – 0.2%, where failure refers to a complete or partial loss of the volume, depending on the size and performance of the volume. … For example, if you have 1,000 EBS volumes running for 1 year, you should expect 1 to 2 will have a failure.”
The real question is why, when avoiding data loss is a solved problem on-premises, you’d accept even one or two lost EBS volumes per year in the cloud. No matter what business you’re in, whether media & entertainment, genomics research, autonomous driving, or even something as simple as home folders, your data is precious. What might be in those EBS volumes you’re losing every year? How will their loss affect your business? There’s no way of knowing—and that’s a risk no business can afford to take casually.
And Amazon’s assurances don’t even take into account compute node failure rates, which can be higher than you think. EC2 instances can fail for a variety of interesting reasons. One common case results from the fact that that AWS is, at its core, a data center of shared hardware. If a piece of hardware is going to undergo maintenance or be decommissioned, your EC2 instance will need to be moved, and that will cause a reboot. An even simpler example is when the underlying piece of hardware has a fault that causes all the instances it hosts to be shifted to another piece of hardware, which in turn causes those instances to reboot. Any reboot will temporarily cause the node to appear failed, so traffic will need to switch to an active node.
If a compute node does go down, ONTAP will fail over from the active node to the surviving node. Unless, of course, it’s the t2.micro NetApp Cloud Manager that fails. If this happens, you’ve lost air traffic control for all your storage traffic in the public cloud, and there’s nothing to move clients from a failed node to the surviving node. Now you’ve got a real problem. In addressing the risk of a failed node, NetApp Cloud Manager ends up adding a new failure condition to the mix. Surely we can expect better for our next-generation enterprise architectures.
NetApp ONTAP serves as a cautionary tale about moving legacy technology to the public cloud without accounting for the inherent differences in these environments. Qumulo’s cloud-native approach makes it possible survive disk and node failures without introducing further complexity, and without excessive cost. By taking the time to do things in the right way for each type of infrastructure—on-premises and public cloud—we can provide the simple, reliable HA you need for the data your business depends on.
John McGovern has seen over a decade helping companies integrate critical technologies into their infrastructure. At Qumulo, he is responsible for building Qumulo for the cloud, helping customers scale their storage systems beyond the data center.