Windows Failover Cluster using EBS Multi Attach

The Server Cluster model is designed to meet the increasing needs in accessing critical applications such as e-commerce, databases, etc. These applications require high fault tolerance, continuous availability, and scalable capabilities. The features of a Server Cluster ensure that the system remains operational and services are always available, even in the event of failures like disk crashes or server downtimes.

1. Introduction

Windows failover clustering is a Microsoft technology that provides high availability at the server level. It essentially involves multiple servers in a group or cluster. If one server in the cluster encounters a system failure, another server in the cluster takes over the workload of the failed server. Each physical server in the cluster is known as a node, and these nodes work together to form a cluster. All nodes in a failover cluster continuously communicate with each other. If a node loses communication with other nodes in the cluster, another node automatically takes over the services of the disconnected node. This process is called failover. The failed node is then restored, a process known as failback.

When a node experiences downtime, the Windows failover cluster restarts the failed services or applications on one of the remaining nodes. The time required to complete a failover depends partly on the hardware used and partly on the quality of the service or application.

Amazon Elastic Block Store (EBS) is a cloud storage service from Amazon Web Services (AWS) that provides volumes that can be attached to EC2 (Elastic Compute Cloud) virtual machines. “EBS Multi-Attach” is a new feature in EBS that helps multiple EC2 virtual machines connect to an EBS volume simultaneously. Before the Multi-Attach feature appeared, each EBS volume could only be mounted to a single EC2 virtual machine at a time. Multi-Attach extends this capability, allowing multiple virtual machines to connect to the same EBS volume at the same time. This provides many benefits, including: Share data between virtual machines: Virtual machines can share and access data from the same EBS volume without copying data back and forth between individual volumes. Reduce data recovery time: In case one virtual machine fails, another virtual machine can immediately connect and continue using the EBS volume without having to wait for the data recovery process. Adapts to flexible architectures: Multi-Attach supports the deployment of applications and system architectures that require high availability and data sharing among multiple virtual machines. Note that not all EBS volume types support Multi-Attach, and this feature is only supported for specific types of EC2 virtual machines. You should check the official AWS documentation to ensure that you are using resources that support Multi-Attach.

2. Components of Windows Failover Cluster

Cluster Node: Each server participating in the cluster is called a cluster node. They need to be interconnected. Cluster nodes must communicate regularly to determine the status of each node. This connection is known as the cluster heartbeat. All cluster nodes must run the same version of Windows Server, such as Windows Server 2019.
Cluster Service: Cluster Service is the central component that controls the operation of the failover cluster. It runs on all cluster nodes and is managed by the Failover Cluster Manager.
Virtual IP Address and Cluster Name: The virtual IP address and cluster name are uniquely set for clusters, and they vary in value depending on each cluster node. They are the information that clients connected to use to connect to the cluster. This allows clients to transparently connect to cluster services and applications after failover.
Cluster Quorum: The purpose of cluster quorums is to determine which nodes will participate in the cluster in case of hardware failure, network outages, and when cluster nodes cannot communicate with each other. Windows failover cluster supports various types of quorums to accommodate different cluster arrangements and the number of nodes.
Services and Applications: Cluster services and applications are essentially defined as unique in failover. In previous versions of failover clustering, they were defined as resource groups. At any given time, a service or application is owned by only one cluster node. If that cluster node fails, another cluster node will own the resource group[1] and start it on that node. We can configure preferred nodes for a service or application to failover to. Resources are health-checked using a tool called LooksAlive, which checks the operational status for applications. By default, SQL Server uses LooksAlive to test status every 5 seconds.
Shared Storage: Clusters require shared storage as they cannot be built with direct access traffic. Shared storage here can use iSCSI SAN (storage area network) or Fibre Channel SAN. For SQL Server, a shared disk resource contains all system and user databases, logs, FileStream, and integrated full-text search files. In case of failover, the disks are reattached to a backup node, and then the SQL Server service is restarted on that node.

Create VPC

3. Functions of Windows Failover Cluster

Windows failover clustering provides the following key functions:

Automatic failover: When a node fails, the cluster automatically switches services to an alternate standby node.
Rapid failover: The failover process is completed in about 30 seconds.
Transparent to clients: After failover, clients can immediately reconnect to the cluster without changing their network paths.
Transactional integrity: There is no data loss. For SQL Server, all committed transactions are saved and reapplied to the database after the failover process is complete.