vSphere HA (VMware HA)
VMware vSphere HA (High Availability) is a utility included in VMware’s vSphere software that can restart failed virtual machines (VMs) on alternative host servers to reduce application downtime.
VSphere HA enables a server administrator to pool physical servers on the same network into a logical group called a high availability cluster. During a server failure, such as a system crash, power interruption or network failure, vSphere HA detects which VMs are down and restarts them on another stable system within the cluster. This process of restarting failed workloads on secondary systems is called failover.
VMware first introduced vSphere HA in Virtual Infrastructure 3 in 2006, and has continued to develop and support the feature.
Used generally, high availability is a term used to describe systems or applications that are available — functioning as expected — a high percentage of the time. In enterprise data centers, system availability often exceeds 99%, and is often measured in nines.
How vSphere HA works
VMware vSphere HA uses a utility called the Fault Domain Manager agent to monitor ESXi host availability and to restart failed VMs. When setting up vSphere HA, an administrator defines a group of servers to serve as a high-availability cluster. The Fault Domain Manager runs on each host within the cluster. One host in the cluster serves as the master host — all other hosts are referred to as slaves — to monitor signals from other hosts in the cluster and communicate with the vCenter Server.
Host servers within an HA cluster communicate via a heartbeat, which is a periodic message that indicates a host is running as expected. If the master host fails to detect a heartbeat signal from another host or VM within the cluster, it instructs vSphere HA to take corrective actions. The type of action depends on the type of failure detected, as well as user preferences. In the case of a VM failure in which the host server continues to run, vSphere HA restarts the VM on the original host. If an entire host fails, the utility restarts all affected VMs on other hosts in the cluster.
The HA utility can also restart VMs if a host continues to run, but loses a network connection to the rest of the cluster. The master host can monitor if that host is still communicating with network-connected data stores to detect if a network-segregated host is still running. Shared storage, such as a storage area network, enables hosts in the cluster to access VM disk files and restart the VM, even if it was running on another server in the cluster.
VMware Distributed Resource Scheduler (DRS) is often used in conjunction with vSphere HA to rebalance workloads that must be restarted on alternate hosts. An organization that uses vSphere HA and DRS together can ensure that restarted VMs do not affect the performance of other VMs on the failover host.
The VMware Fault Tolerance feature can also ensure very high levels of availability. While vSphere HA restarts failed VMs after a short detection and boot-up time, Fault Tolerance maintains a redundant copy of the protected VM that can seamlessly take over the operations of the failed copy.
How to set up and use vSphere HA
The first step to set up vSphere HA is to create a cluster from the vSphere Web Clientunder Create a Cluster, and then select ESXi hosts and shared storage to participate in the cluster. HA clusters must contain at least two hosts, but many organizations maintain larger clusters that pool more resources and can accommodate multiple failures.
An admin can then turn on the vSphere HA feature from the Web Client under Manage > Settings > vSphere HA. Finally, a user can adjust vSphere HA configuration settings and preferences from the vSphere Web Client.
VMware vSphere HA requirements and best practices
Administrators can adjust many HA settings, including how long a VM or host is unavailable before vSphere HA attempts to restart it; the default value is 120 seconds. An admin can set VM restart preferences, selecting the order in which VMs restart in the cluster. This setting is useful if, for example, there is insufficient space on the cluster to restart all the failed VMs. In many cases, an administrator assigns a higher restart priority to VMs running mission-critical applications.
An organization can also define affinity and anti-affinity rules to restrict where certain VMs are placed. Affinity and anti-affinity rules prevent specified VMs from restarting on selected servers or on servers that already host other specified VMs. These rules are useful to ensure that CPU-intensive VMs don’t restart on the same host after a disaster, or to ensure that two copies of a high-priority application don’t end up on the same host and create a potential single point of failure.