How to bring HA and DR to SQL Server on Amazon EC2

The pros and cons of the various AWS, Microsoft, and third-party options for protecting mission-critical SQL Server databases in the Amazon cloud

Bring HA and DR to SQL Server on Amazon EC2
Thinkstock

The successful migration from purely private to public and hybrid cloud arrangements has substantially increased the level of confidence IT administrators have in the cloud. And among the top cloud service providers chosen by administrators is Amazon Web Services, based on its proven track record of security and reliability. But challenges remain to providing adequate high availability and disaster recovery protections for mission-critical SQL Server databases.

This article provides some practical guidance for system and database administrators tasked with protecting mission-critical SQL Server running in the Amazon Elastic Compute Cloud (EC2), beginning with how those databases are normally configured.

Amazon EC2 Instance Store vs. EBS

At the risk of oversimplification, AWS has two EC2 storage options that are normally used for SQL Server databases: Instance Store and Elastic Block Storage. Instance Store is a form of ephemeral or non-persistent storage that uses locally attached media (SSD or HDD). The data store is, by default, not replicated, and this option lacks support for snapshots. Instance Store is a good choice for temporary databases, and, when supplemented with separate provisions for persisting and protecting the data, it is also a good choice for very large databases or databases requiring high transactional throughput.

Elastic Block Store (EBS) is block-level storage that is optimized for random I/O. EBS persists and replicates the data, providing native redundancy, and it supports snapshots. Other advantages include support for a write cache and encryption of data at rest in EBS volumes. For applications requiring high throughput performance, AWS recommends using striping. In addition to providing general-purpose storage, EBS also offers a choice of volume types for provisioned IOPS performance or throughput optimization.

While EBS may seem to be the better choice for protecting SQL Server databases, Instance Stores are able to deliver far superior performance, with up to 10 times the throughput of EBS, and they support up to 48 terabytes of storage. And while EC2 includes no high availability (HA) or disaster recovery (DR) provisions with either option, full protection for mission-critical databases can be provided by other means.

For both Instance Store and Elastic Block Storage, there are two types of EC2 compute instances best suited for use in database applications: memory-optimized or storage-optimized for I/O. While this aspect of the configuration is beyond the scope and intent of this article, which is focused on protecting the data, here are the suitable compute instances for those interested in learning more: X1, X1E, R5, R5d, and Z1d when optimizing for memory, and I3 when optimizing for storage I/O.

It is worth noting that AWS also offers a fully managed Relational Database Service (RDS) with a choice of six different database engines. However, note that RDS support for SQL Server is not as robust as it is for other databases like Amazon Aurora, MySQL, and MariaDB. Here are some of the common concerns administrators have about using RDS for mission-critical SQL Server applications:

  • Support for only a single mirrored standby instance
  • Agent Jobs are not mirrored and must be created separately in the standby instance
  • No support for multi-region configurations needed for disaster recovery
  • No support for performance-optimized in-memory database instances
  • Performance can be adversely impacted based on Availability Zone assignments, over which the customer has no control
  • Requires the more expensive Enterprise Edition with its Always On Availability Groups feature to replicate the data

Amazon EC2 SLA caveat emptor

Before covering the HA/DR options available for SQL Server, it is important to be aware of certain limitations in the Amazon EC2 Service Level Agreement. When using multiple Availability Zones, the AWS SLA guarantees an uptime of four nines or 99.99 percent. And the SLA is a money-back guarantee that offers refunds ranging from 10 percent to 100 percent whenever a monthly service level drops below four nines, with the full refund given when the service level drops below 95.0 percent.

The problem is, the SLA has a rather narrow view of what constitutes uptime. Explicitly excluded is any downtime or unavailability caused by the following: “factors outside of our reasonable control” (e.g. natural disasters); “actions or inactions of you or any third party” (i.e. human error); and “your equipment, software or other technology and/or third party equipment, software or other technology” (e.g. SQL Server). In effect, AWS only guarantees “dial tone,” i.e., that at least one EC2 instance will have external connectivity. In other words, any failures in the database itself or any applications access the database are not covered—or even detected for that matter.

So while it is advantageous to leverage the various services available within the AWS cloud, additional provisions are needed to ensure adequate HA and DR protections for SQL Server at the database and application levels. And those begin with SQL Server itself.

SQL Server’s Always On options

SQL Server offers two of its own options for protecting the data: Always On Failover Cluster Instances and Always On Availability Groups. Always On Failover Cluster Instances (FCIs) have been a standard feature since SQL Server 7. FCIs afford two major advantages: inclusion in the Standard Editions of SQL Server and protection for the entire SQL Server instance, including system databases. A notable disadvantage is the need for cluster-aware shared storage, which is not available in the AWS cloud. On premises, by contrast, where shared storage can and often does exist, FCIs are a viable and popular option. In fact they have been the predominant HA solution since SQL Server 7.

Always On Availability Groups replaced database mirroring in SQL Server 2012 Enterprise Edition, and this feature is also included in SQL Server 2017 for Linux. This is SQL Server’s more robust HA/DR offering, capable of delivering rapid, automatic failovers with no data loss for HA, as well as protecting against widespread disasters by leveraging asynchronous replication with minimal data loss. Another advantage of Availability Groups is support for readable secondaries (with appropriate licensing). But this option requires licensing the more expensive Enterprise Edition, making it cost-prohibitive for many applications, and lacks protection for the entire SQL Server instance. There is a less capable Basic Availability Groups feature that was added to the Standard Edition of SQL Server 2016, but it supports only a single database per Availability Group and does not permit readable secondaries.

A notable disadvantage with application-specific options like Always On Availability Groups is the need for administrators to implement other HA or DR solutions for all non-SQL Server applications. Having multiple HA/DR solutions inevitably increases complexity and costs (for licensing, training, implementation, and ongoing operations), which is why many organizations prefer using third-party, general purpose or application-agnostic solutions.

Third-party HA/DR options for SQL Server

Perhaps the biggest challenge encountered when configuring the AWS cloud for mission-critical HA/DR protections is that traditional failover clustering solutions, like Windows Server Failover Clustering (WSFC), use shared storage, and there is no shared storage available. Microsoft addressed this issue with the advent of Storage Spaces Direct (S2D) in Windows Server 2016 and SQL Server 2016. But because S2D does not support configurations that span Availability Zones—a critical requirement in HA configurations—it is not a viable option for AWS.

The challenges with Linux are even greater. Because Linux lacks a fundamental clustering capability equivalent to WSFC, administrators have struggled with making complex do-it-yourself HA Linux configurations work dependably using open source software. But the complexity in the full HA stack makes for a daunting undertaking that undermines the cost-saving rationale for using open source software on commodity hardware in public cloud services. As a result, DIY HA projects have been prohibitively expensive for most organizations. Just making Distributed Replicated Block Device, Corosync, Pacemaker, and, optionally, other open source software work as intended at the application level under a single failure scenario can be extraordinarily difficult. It is for this reason that only the very largest organizations have the wherewithal (skill set and staffing) needed even to consider taking on the task.

The various challenges involving both Windows Server and Linux have been overcome by third-party failover clustering solutions, such as SIOS DataKeeper and the SIOS Protection Suite, that are purpose-built for private, public, and hybrid cloud environments. These solutions are implemented entirely in software to enable creating a cluster of servers and storage—without any need for shared storage—and with automatic failover to assure high availability at the application level. Some of the more sophisticated solutions also offer advanced capabilities like WAN optimization to maximize multi-region performance, and manual switchover of active and standby server assignments to facilitate planned maintenance.

Versions for Windows Server are designed to work seamlessly with WSFC, providing real-time, block-level data replication in an on-premises SAN or cloud-based SANless environment. Support for FCIs in the less expensive Standard Edition of SQL Server for Windows Server, without any need to compromise availability or performance, is a major advantage.

Versions for Linux must provide a complete HA/DR solution that includes data replication, continuous application-level monitoring, and configurable failover/failback recovery policies. Linux is becoming increasingly popular for SQL Server and many other enterprise applications, and third-party failover clustering solutions now make implementing HA/DR provisions nearly as easy as it is for Windows Server.

While specific to the operating system, most failover clustering software is application-agnostic, enabling organizations to have a single, universal HA/DR solution. This same capability is what makes it possible to protect the entire SQL Server instance, including the database, log-ons, agent jobs, etc., all in an integrated fashion.

Whether for Windows Server or Linux, SANless failover clustering provides data replication for both Instance Stores and EBS, overcoming a limitation of the former and replacing (or supplementing) the replication provided by the latter. For clusters located within a single AWS Region, the data replication is synchronous and failovers can occur automatically. For clusters spanning multiple Regions, the data replication is asynchronous to avoid adversely impacting throughput performance, and failovers occur manually to minimize the potential for data loss.

The diagram below shows a popular EC2 configuration that provides both HA and DR protections in a Virtual Private Cloud (VPC) that spreads three SQL Server instances across multiple Availability Zones and Regions. The configuration enables rapid automatic failover for a localized failure, and facilitates a full, albeit manual recovery for a widespread disaster. Note how this configuration also overcomes yet another limitation—this one in the Standard Edition of SQL Server—of being able to have only two FCI nodes in a failover cluster.

sios protect sql server in aws ec2 SIOS Technology

This common configuration of a SANless failover cluster consists of a two-node HA cluster spanning two AWS Availability Zones, along with a third instance deployed in a separate AWS Region providing DR protection.

SANless failover clusters can be configured with only a single standby instance, of course. But such a minimal configuration does require a third node to serve as a witness. The witness is needed to achieve a quorum for determining the assignment of the active instance, and this important task is normally performed by a Windows Server instance in a separate Availability Zone. Keeping all three nodes (active, standby, and witness) in different Availability Zones eliminates the possibility of losing more than one vote if any zone goes offline.

It is also possible to have two-node and three-node SANless failover clusters in hybrid cloud configurations for HA or DR purposes. One such three-node configuration is a two-node HA cluster located in an enterprise data center with asynchronous data replication to the AWS cloud for DR protection—or vice versa.

1 2 Page 1
Page 1 of 2
How to choose a low-code development platform