Beyond Cloud SLAs: How to Navigate the Nuances of Resilience

The cloud seems to offer the promise of ubiquitous access and resilience. One can work from anywhere, ensuring that everyone who needs to access mission-critical applications and data can do so. Cloud services providers even provide service-level agreements (SLAs) for their high availability (HA) clusters, guaranteeing that at least one of the virtual machines (VMs) running your critical applications will be accessible no less than 99.99% of the time — the industry-standard definition of high availability. With a guarantee of 99.99% VM availability, the infrastructure supporting your critical applications can be inaccessible for no more than about four and a half minutes per month.

But look a little closer. The cloud SLAs guarantee access to the infrastructure, but it's access to your applications and data that really matters to your organization and its customers. The SLAs guaranteeing an HA level of resilience make no mention of your applications. They do not guarantee that your applications will be able to access and interact with their critical databases. In other words, you could run into a situation where the VM is operational and meeting the cloud SLA requirements but your critical application is offline.

But don't despair. You can configure your cloud infrastructure for HA and data availability. You just need to do a bit more than rely on the infrastructure SLAs that AWS, Azure, and Google Cloud provide.

Configuring for Application Availability

At the heart of any discussion of HA — whether in the cloud or on-premises — is the notion of redundancy. If the primary VM running your application stops responding — for whatever reason — another VM needs to be available to take over the workloads that the primary VM had been running. One way to support that is through a failover cluster that consists of two or more VMs, one of which is always standing by in case the primary VM stops working. For true HA, you'd deploy those VMs in separate cloud data centers or availability zones (AZs). That way, if a large-scale event such as a tornado or a power outage (or the human error that famously wreaked havoc in a public cloud) shuts down the entire data center where the active VM is running, the secondary VM in a geographically distinct data center is unlikely to be affected and can immediately take over. Software managing the failover cluster would cause the jobs on the primary VM to "fail over" to the secondary VM.

But failover by itself is insufficient. Your secondary VM must be able to access the data with which the applications on your primary server had been interacting. If you're using SQL Server, for example, your secondary VM needs to be able to interact with the same data that the primary VM had been interacting with or the secondary VM is not going to be able to take over where the primary VM left off.

Here, the cloud poses a challenge. While AWS and others offer shared storage options, each solution comes with its own challenges and limitations, and in many cases will be cost-prohibitive. Instead of a shared storage solution, your HA infrastructure is going to need a solution that synchronously replicates all the transactions taking place in the database on the primary VM to a copy of the database running on the secondary VM. That way, if the secondary VM is suddenly called into service, all the data that had been running on the primary VM is available to the secondary VM, which can take over immediately.

Ensuring the Accessibility of Data

While the cloud SLAs will ensure that at least one of those two VMs will be highly available, the cloud providers do not provide a mechanism for replicating the data among the VMs — which is why your organization might not be able to access its critical applications and data when at least one VM still appears to be accessible. To gain the services that can deliver the data replication services you need, you have two choices: either draw on replication services that might be native to your applications, such as the Availability Groups (AGs) feature of SQL Server, or use a third-party SANless clustering tool, which will replicate anything in storage on the primary VM to storage on the secondary VM.

Fundamentally, both approaches will provide you with the synchronous data replication services you need to ensure ongoing access to applications and data from your secondary VMs. The differences between the approaches lie in the details. A solution such as the AGs feature of SQL Server will replicate only the data associated with SQL Server. It will not replicate any other data that might be in storage on your primary VM. Conversely, a SANless clustering approach will replicate anything in storage on the primary VM to storage on the secondary VM.

If you're thinking about a database-native replication tool, determine whether it imposes limits on the number of databases you can replicate or the number of targets to which you can replicate. Using the Basic AG functionality of SQL Server standard edition, for example, a single availability group can replicate a single user-named database to a single secondary VM. If you want to replicate multiple databases concurrently, you'll either have to configure multiple one-to-one AGs, which may not failover concurrently in a failover scenario, or you'll have to upgrade to the Always On AG feature of SQL Server Enterprise Edition — which can be a very costly upgrade if you're just upgrading for the replication functionality. Same thing if you want to replicate a SQL Server database to multiple secondary VMs. You'll need to run SQL Server Enterprise Edition to do that, even if your applications only require the functionality of SQL Server Standard Edition. In contrast, the SANless clustering approach simply replicates data — regardless of the number of databases or the number of replication targets.

The SLAs offered by the cloud service vendors provide what is best described as necessary but insufficient protection for your HA needs. You need to focus on data replication to ensure that your cloud infrastructure provides the resilience and availability your organization expects.

About the Author:

Dave Bermingham is the Senior Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high availability expert and has been honored by his peers by being elected to be a Microsoft MVP in Clustering since 2010 and for the past few years as a Cloud and Datacenter MVP. Dave is a frequent speaker at technical conferences, including SQL Saturdays, Pass Summit, and MSSQL Tips, and is the author of Clustering for Mere Mortals blog. Dave holds numerous technical certifications and has more than 30 years of IT experience, including in finance, healthcare, and education.

Comments

Plain text