VMware Fault Tolerance (FT)



Introduction to VMware Fault Tolerance (FT)

·         FT provides continuous availability for a VM
o   Zero down time
·         Takes VMHA to the next level
·         Works for all applications and 99% of guest OS.
·         Does this by creating a "live shadow" copy of the running VM then keeping them in "lockstep" using VMware's vLockstep.
·         If an ESX server fails, the shadow will take over ad a new shadow will be created in the cluster on another ESX server
·         Primary VM is called the "Primary" and the copied/lockstep VM is the secondary
·         The virtual disk for the VM is on shared storage and never moves
·         "Continuous VMotion"
Requirements of FT
·         CPUs on all FT ESXi servers must match and be from a specific list of processors
·         Hardware Virtualization enabled in the BIOS
·         Recommended minimum # of 1GB NICs=3
·         One NIC on each Server must enabled for FT logging and vMotion
·         ESXi servers must be running same build
·         VMs on shared SAN,accessible by servers
·         Must be enabled in a HA cluster
·         vSphere Enterprise or Enterprise Plus
Cluster Requirements
·         Host Certificate checking must be enabled
·         At least 2 FT-certified hosts running the same FT version or host build number.
·         Hosts need access to the same storage
·         FT Logging and VMotion Networking need to be configured.
·         HA must be enabled on the cluster. If it isn't you will not be able to power on an FT machine or add a host running an FT machine already to the cluster.
Host Requirements
·         Must contain processors from the FT-compatible processor group. Highly recommended that CPUs are also compatible with one another.
·         Must be licensed for FT (Enterprise or Enterprise Plus)
·         Must be certified for FT (HCL).
·         BIOS must have Hardware Virtualization (HV) enabled.
VM Requirements
·         Virtual disks must either be in virtual RDM mode or VMDK files (no physical RDM). The disk must also be in thick format.
·         VM files must be stored on shared storage (FC, FCOE, iSCSI, NFS, NAS).
·         Cannot have more than one cpu.
·         Must be running on Windows 7, Windows Server 2008, Vista, 2003, XP, 2000, NT 4, All Linux supported by ESX, Netware, solaris 10, and FreeBSD ( there are some limitations on processors though, so check them out).

Constraints of FT
·         Single vCPU in each VM only (no SMP)
·         Require Specific hardware
·         Recommended minimumof 4 VMs running FT on an ESX server
·         "Line of site" between ESXi servers due to latency
·         Only thick disk is supported
·         Snapshots are not allowed(include via VADP backup products)
·         Cannot invoke a svMotion on a VM with FT enabled
·         Linked clones are not allowed on a VM with FT enabled
·         Some guests not supported and some guests require shutdown to enable
The following is not supported with FT
·         Snapshots
·         Storage vMotion
·         Linked Clones
·         Cannot backup an FT machine using the Storage API for Data Protection, VMware Data Recovery. Array based snapshots however do not affect it.
·         Cannot use a floppy or cdrom backed by physical or remote device (only shared storage img and isoimages).
·         USB and sound devices
·         NPIV
·         NIC passthrough
·         vlance networking drivers
·         No Hot plugable features (includes changing attached networks).
·         EPT/RVI
·         Serial or parallel ports
·         IPv6
·         3D enabled video drivers.
Testing to see if you can use FT with VMware Site Survey
·         Site Survey saves time by automating this check
·         Run Site Survey on your cluster to see if you can use FT
 
Enabling VMware FT
·         Once requirements have been met, enabling FT is easy
o   Right Click on a VM
o   Go to Fault Tolerance
o   Click Turn on Fault Tolerance
Configure VMware Fault Tolerance networking
Prerequisites
·         Multiple Gigabit NICs. Each host will need at least two, one for FT Logging and one for vMotion.

·         Configuring the networking is quite easy, essentially create two vmkernel ports, one for vMotion and one for FT Logging. *** NOTE *** The FT traffic is not encrypted, so secure this network as best you can, probably best to have a private network.

·         After you have created the vmkernel port for FT logging your hosts summary tab should show 'Configured for FT'. If there is an issue, the little blue comment box will display what it is as your hover over it.

Configure VMkernel NIC
 
Enable/Disable VMware Fault Tolerance on a virtual machine
Enable Fault Tolerance
This is actually quite easy. Right click a VM and select 'Fault Tolerance' -> 'Enable Fault Tolerance'

This option may be dimmed if

·         The VM is registered on a host that isn't licensed for FT
·         The VM is on a host that is in maintenance or standby
·         The VM is disconnected or orphaned
·         The user doesn't have the permission to do this.

After selecting Enable Fault Tolerance the following validation checks are performed
·         SSL certification checking is enabled
·         The host is in a vSphere HA cluster or mixed HA and DRS cluster
·         host has ESX(i) 4.0 or greater installed
·         VM doesn't have multiple CPUs, snapshots, ha disabled or a 3d video device.
·         Checks the BIOS for HV
·         Checks processors for primary and secondary
·         Checks processors in conjunction with the OS

The following occurs when enabling FT
·         A secondary VM is created. The placement and status of this VM will vary depending on the power state of the primary VM
o   If Primary is Powered ON
§  Entire state of primary VM is copied and the secondary is created, placed on a separate host and powered on (if it passes admission control).
§  FT status on the VMs summary tab will be 'Protected'
o   If Primary is powered off
§  Secondary is immediately created and registered to a host in the cluster ( could even be same host as primary but will be moved on power on ).
§  Secondary VM will not be powered on until the primary is powered on.
§  FT status will display 'Not Protected, VM not Running'
·         Once Fault tolerance is enabled, vCenter will remove the VMs memory limits and reservations and set a new memory reservation equal to the memory size of the VM. While FT is enabled on this VM you cannot change memory reservations, limits, size, or shares. If you disable FT, these values are not reverted back.

Once enabled, the FT section in the summary tab will show you the following

·         FT Status
o   Protected – Primary and secondary are powered on and running as expected
o   Not Protected – Secondary VM is not running. It will also provide a reason
·         Starting – FT is in the process of starting the secondary.
·         Need Secondary VM – Primary VM is running without a secondary. Normally caused by the
inability to create a secondary due to incompatible hosts. If there are compatible hosts, sometimes disabling ft and re-enabling will fix this.
·         Disabled – FT is currently disabled ( occurs when FT is disabled by the user or vCenter Server may disable FT after being unable to power on the secondary).
·         VM Not Running – Ft is enabled, but primary is powered off.
·         Secondary Location – shows which host is running the secondary VM
·         Total Secondary CPU – shows the CPU usage of the secondary VM (MHz)
·         Total Secondary Memory – shows the total memory usage of the secondary (MB)
·         vLockstep Interval – The time interval in seconds needed for the secondary VM to match the current execution state of the Primary. Typically less than 1/2 a second. No state will be lost even if this interval is high.
·         Log Bandwidth – Amount of network capacity used to send FT log info from the host running the primary to the host running the secondary.

To disable just right click and chose 'Fault Tolerance' -> 'Turn off fault tolerance'

FT in the VMkernel
·         The FT vmkernel module is called vmklogger.
·         Log entries are put in the log buffer, which is flushed/filled asynchronously.
·         Log entries are sent/received through socket on VMkernel NIC.
·         There should be a dedicated VMkernel network for logging which has FT Logging enabled.
 
Test an FT configuration
VMware provides a couple of FT scenario's that can be tested
Testing FT Failover
·         The secondary machine will become the new primary, the old primary is then removed.
·         A new secondary machine will spawn up and sync up with the new primary.
Testing Restart Secondary
·         This will destroy the current secondary VM and restart another one.
·         The primary is unaffected during this test.
Determine use case for enabling VMware Fault Tolerance on a virtual machine
There are a number of use cases for Fault Tolerance. Its best to keep in mind that Fault Tolerance however does not protect against an OS failure, or an application failure, it simply protects against a host failure. Some use cases for FT might include
·         Applications that need to be highly available (especially those with long lasting client connections) that you want to survive a hardware failure.
·         Custom built applications that have no other form of clustering available.
·         It’s a simple way to provide HA to an application and doesn't require difficult and complex setups like other clustering solutions.
·         If you want to protect a key VM during a critical time to ensure there would be no downtime if a host fails.

Viewing Information about Fault Tolerant VMs

·         Fault Tolerant VMs have an additional Fault Tolerance pane on their summary tab which provides information about the Fault Tolerance setup and performance.
·         Fault Tolerance Status - Indicates the status of fault tolerance - Protected or Not Protected/Disabled.
 


·         Secondary Location - Displays the ESX/ESXi host on which the secondary virtual machine is hosted.
·         Total Secondary CPU - Indicates all secondary CPU usage, displayed in MHz.
·         Total Secondary Memory - Indicates all secondary memory usage, displayed in MB.
·         Secondary VM Lag Time shows the current delay between the primary and secondary VM.
·         Log Bandwidth shows the consumed bandwidth on the link for Record/Replay operations between the primary and secondary VM.
o   This value is based on the FT operations only, and is not the bandwidth usage on the wire (i.e with. TCP/IP/Ethernet headers).
FT Virtual Machine files
 
Maps View of an FT VM

Troubleshooting Fault Tolerant Virtual Machines

·         To maintain a high level of performance and stability for your fault tolerant virtual machines and also to minimize failover rates, you should be aware of certain troubleshooting issues.
·         The troubleshooting topics discussed focus on problems that you might encounter when using the vSphere Fault Tolerance feature on your virtual machines.

1.       Hardware Virtualization Not Enabled
You must enable Hardware Virtualization (HV) before you use vSphere Fault Tolerance.
Problem:
When you attempt to power on a virtual machine with Fault Tolerance enabled, an error message might appear if you did not enable HV.
Cause:
This error is often the result of HV not being available on the ESXi server on which you are attempting to power on the virtual machine. HV might not be available either because it is not supported by the ESXi server hardware or because HV is not enabled in the BIOS.
Solution
If the ESXi server hardware supports HV, but HV is not currently enabled, enable HV in the BIOS on that server. The process for enabling HV varies among BIOSes. See the documentation for your hosts' BIOSes for details on how to enable HV.
If the ESXi server hardware does not support HV, switch to hardware that uses processors that support Fault Tolerance
2.       Compatible Hosts Not Available for Secondary VM
If you power on a virtual machine with Fault Tolerance enabled and no compatible hosts are available for its Secondary VM, you might receive an error message.
Problem
The following error message might appear in the Recent Task Pane:
Secondary VM could not be powered on as there are no compatible hosts that can accommodate it.
Cause
This can occur for a variety of reasons including that
·         There are no other hosts in the cluster
·         There are no other hosts with HV enabled
·         Data stores are inaccessible
·         There is no available capacity
·         Hosts are in maintenance mode.
Solution
·         If there are insufficient hosts, add more hosts to the cluster.
·         If there are hosts in the cluster, ensure they support HV and that HV is enabled. The process for enabling HV varies among BIOSes. See the documentation for your hosts' BIOSes for details on how to enable HV.
·         Check that hosts have sufficient capacity
·         That they are not in maintenance mode
3.       Secondary VM on Overcommitted Host Degrades Performance of Primary VM
If a Primary VM appears to be executing slowly, even though its host is lightly loaded and retains idle CPU
time, check the host where the Secondary VM is running to see if it is heavily loaded.
Problem
When a Secondary VM resides on a host that is heavily loaded, this can effect the performance of the Primary VM.
Evidence of this problem could be if the vLockstep Interval on the Primary VM's Fault Tolerance panel is yellow or red. This means that the Secondary VM is running several seconds behind the Primary VM. In such cases, Fault Tolerance slows down the Primary VM. If the vLockstep Interval remains yellow or red for an extended period of time, this is a strong indication that the Secondary VM is not getting enough CPU resources to keep up with the Primary VM.
Cause
A Secondary VM running on a host that is overcommitted for CPU resources might not get the same amount of CPU resources as the Primary VM. When this occurs, the Primary VM must slow down to allow the Secondary VM to keep up, effectively reducing its execution speed to the slower speed of the Secondary VM.
Solution
To resolve this problem, set an explicit CPU reservation for the Primary VM at a MHz value sufficient to run its workload at the desired performance level. This reservation is applied to both the Primary and Secondary VMs ensuring that both are able to execute at a specified rate. For guidance setting this reservation, view the performance graphs of the virtual machine (prior to Fault Tolerance being enabled) to see how much CPU resources it used under normal condition
4.       Virtual Machines with Large Memory Can Prevent Use of Fault Tolerance
You can only enable Fault Tolerance on a virtual machine with a maximum of 64GB of memory.
Problem
Enabling Fault Tolerance on a virtual machine with more than 64GB memory can fail. Migrating a running
fault tolerant virtual machine using vMotion also can fail if its memory is greater than 15GB or if memory is changing at a rate faster than vMotion can copy over the network.
Cause
This occurs if, due to the virtual machine’s memory size, there is not enough bandwidth to complete the
vMotion switchover operation within the default timeout window (8 seconds).
Solution
To resolve this problem, before you enable Fault Tolerance, power off the virtual machine and increase its
timeout window by adding the following line to the vmx file of the virtual machine:
ft.maxSwitchoverSeconds = "30"
where 30 is the timeout window in number in seconds. Enable Fault Tolerance and power the virtual machine back on. This solution should work except under conditions of very high network activity.
NOTE:
If you increase the timeout to 30 seconds, the fault tolerant virtual machine might become unresponsive
for a longer period of time (up to 30 seconds) when enabling FT or when a new Secondary VM is created after a failover.
5.       Secondary VM CPU Usage Appears Excessive
In some cases, you might notice that the CPU usage for a Secondary VM is higher than for its associated Primary VM.
Problem
When the Primary VM is idle, the relative difference between the CPU usage of the Primary and Secondary
VMs might seem large.
Cause
Replaying events (such as timer interrupts) on the Secondary VM can be slightly more expensive than recording them on the Primary VM. This additional overhead is small.
Solution
None needed. Examining the actual CPU usage shows that very little CPU resource is being consumed by the Primary VM or the Secondary VM.
6.       Primary VM Suffers Out of Space Error
If the storage system you are using has thin provisioning built in, a Primary VM can crash when it encounters an out of space error.
Problem
When used with a thin provisioned storage system, a Primary VM can crash. The Secondary VM replaces the Primary VM, but the error message "There is no more space for virtual disk <disk_name>" appears on the vSphere client
Cause
If thin provisioning is built into the storage system, it is not possible for ESX/ESXi hosts to know if enough disk space has been allocated for a pair of fault tolerant virtual machines. If the Primary VM asks for extra disk space but there is no space left on the storage, the primary VM crashes.
Solution
The error message gives you the choice of continuing the session by clicking "Retry" or clicking "Cancel" to terminate the session. Ensure that there is sufficient disk space for the fault tolerant virtual machine pair and click "Retry"
7.       Fault Tolerant Virtual Machine Failovers
A Primary or Secondary VM can fail over even though its ESXi host has not crashed. In such cases, virtual
machine execution is not interrupted, but redundancy is temporarily lost. To avoid this type of failover, be
aware of some of the situations when it can occur and take steps to avoid them.
Partial Hardware Failure Related to Storage
This problem can arise when access to storage is slow or down for one of the hosts. When this occurs there are many storage errors listed in the VMkernel log. To resolve this problem you must address your storage-related problems.
Partial Hardware Failure Related to Network
If the logging NIC is not functioning or connections to other hosts through that NIC are down, this can trigger a fault tolerant virtual machine to be failed over so that redundancy can be reestablished. To avoid this problem, dedicate a separate NIC each for vMotion and FT logging traffic and perform vMotion migrations only when the virtual machines are less active.
Insufficient Bandwidth on the Logging NIC Network
This can happen because of too many fault tolerant virtual machines being on a host. To resolve this problem, more broadly distribute pairs of fault tolerant virtual machines across different hosts.
vMotion Failures Due to Virtual Machine Activity Level
If the vMotion migration of a fault tolerant virtual machine fails, the virtual machine might need to be failed over. Usually, this occurs when the virtual machine is too active for the migration to be completed with only minimal disruption to the activity. To avoid this problem, perform vMotion migrations only when the virtual machines are less active.
Too Much Activity on VMFS Volume Can Lead to Virtual Machine Failovers
When a number of file system locking operations, virtual machine power ons, power offs, or vMotion
migrations occur on a single VMFS volume, this can trigger fault tolerant virtual machines to be failed over. A symptom that this might be occurring is receiving many warnings about SCSI reservations in the VMkernel log. To resolve this problem, reduce the number of file system operations or ensure that the fault tolerant virtual machine is on a VMFS volume that does not have an abundance of other virtual machines that are regularly being powered on, powered off, or migrated using vMotion.
Lack of File System Space Prevents Secondary VM Startup
Check whether or not your /(root) or /vmfs/datasource file systems have available space. These file systems can become full for many reasons, and a lack of space might prevent you from being able to start a new Secondary VM.
8.       VMware Fault Tolerance fails to turn on in a two node cluster
Purpose
Running FT protected virtual machines in a two node cluster is supported. Problems can occur when there is a need to vMotion the primary virtual machine from one host to the other. As the primary and secondary virtual machines cannot reside on the same host, FT must be turned off so that the secondary virtual machines is destroyed. The primary virtual machines can then be vMotioned to the other host.
Resolution
This issue occurs if the monitor mode changes during the vMotion process. FT requires the monitor mode be set to Use Intel VT-x/AMD-V for instruction set virtualization and software for MMU virtualization for the monitor mode to not change during the vMotion process.

The default setting for the virtual machine monitor mode is Automatic and FT sets the monitor mode appropriately behind the scenes. If the hosts in the cluster support the Use Intel VT-x/AMD-V for instruction set virtualization and Intel EPT/AMD RVI for MMU virtualization option, the monitor mode is changed to this during the vMotion process.

To set the monitor mode explicitly to Use Intel VT-x/AMD-V for instruction set virtualization and software for MMU virtualization:

Note: In some instances, the virtual machine needs to be powered off in order to change the monitor mode.
·         Right-click the virtual machine in question and choose Edit Settings.
·         In the virtual machine Properties window, click Options and select the CPU/MMU Virtualization option under the Advanced heading.
·         Select the radio button next to Use Intel VT-x/AMD-V for instruction set virtualization and software for MMU virtualization.
·         Click OK.
·         For the setting to take effect, the virtual machine needs to be power cycled or vMotion to another host. When this is complete, FT can be turned on for the virtual machine.

9.       Processors and guest operating systems that support VMware Fault Tolerance
Details
VMware Fault Tolerance (FT) requires specific processors (CPUs) and guest operating systems.
Solution
Processors 
VMware collaborated with AMD and Intel in providing an efficient vSphere FT capability on modern x86 processors. The collaboration required changes in both the performance counter architecture and virtualization hardware assists of both Intel and AMD and these have been included in all processors launched since early 2008.

For vSphere FT to be supported, the ESXi servers that host the Primary VM and Secondary VM must both use compatible processors. Compatible processors share the same Fault Tolerant Compatible Set as shown in the VMware Compatibility Guide (See http://www.vmware.com/resources/compatibility). Processors in different Fault Tolerant Compatible Sets are not compatible.

In general, a Fault Tolerant Compatible Set comprises processors within the same CPU vendor generation (for example, Intel Nehalem). However, processors across different generations (for example, Intel Westmere with Intel Nehalem) are not FT compatible.  This also means that vSphere FT does not support cross compatibility between Intel and AMD processors. You cannot pair Intel and AMD processors for FT virtual machines.

Lastly, some processors are only FT compatible with themselves. Those are shown as belonging to the Only With Itself set .
Guest Operating Systems
All guest operating systems supported with ESXi are supported with vSphere FT unless noted below. For specific guest operating system version information 

Guest Operating System
Notes or Limitations

Windows 7
Requires VMware vSphere 4.0 Update 1 or greater.
Windows Server 2003 (32 bit)
Requires Service Pack 2 or greater when AMD Opteron Barcelona processor type is used.
Windows XP (32 bit)
AMD Opteron Barcelona processor type is not supported.
Windows 2000
AMD Opteron Barcelona processor type is not supported.
Windows NT 4.0
AMD Opteron Barcelona processor type is not supported.
Solaris 10 (64-bit)
Requires Solaris U1 when AMD Barcelona processor type is used.
Solaris 10 (32-bit)
AMD Opteron Barcelona processor type is not supported.


Note: System vendors are certifying that their systems work with FT. You can find details on the FT-certified systems at http://www.vmware.com/resources/compatibility. More systems are being certified all the time, so check back if your platform is not currently listed.

10.   Backing up Fault Tolerance virtual machines

Purpose
Back up Fault Tolerance (FT) virtual machines.

Note: As taking snapshots of FT virtual machines is not supported, VMware Consolidated Backup is also not supported for FT virtual machines.
Resolution
VMware FT does not support virtual machine snapshots in vSphere 4.x and 5.x However, to protect against storage failure or data corruption, you can back up FT virtual machine using templates and using storage snapshots.
Backing up FT virtual machines using templates
To set up the virtual machine:
·         Before turning on FT for the virtual machine, clone a template of the virtual machine. For more information, see Working with Templates and Clones in the vSphere Basic Administration Guide.
·         Turn on FT for the virtual machine. For more information, see Turning on Fault Tolerance for Virtual Machines in the vSphere Availability Guide.
·         Turn on your in-guest backup application for the FT virtual machine. For more information, consult your vendor documentation.
On recovery:
·         Deploy the template that you created in step 1 of the previous section to a virtual machine. For more information, see Working with Templates and Clones in the vSphere Basic Administration Guide.
·         Use your in-guest backup application to recover the data to this new virtual machine. For more information, consult your vendor documentation.
·         Turn on FT for the virtual machine. For more information, see Turning on Fault Tolerance for Virtual Machines in the vSphere Availability Guide.
·         Turn on your in-guest backup application for the FT virtual machine.
For updates to guest operating systems or applications that can tolerate FT being temporarily turned off:
·         Update the guest operating system or applications in the virtual machine.
·         Turn off FT.
·         Follow steps 1-3 in To set up the virtual machine.
For updates to guest operating systems or applications that cannot tolerate FT being turned off:
·         Update the guest operating system or applications in the virtual machine.
·         Deploy the template that you created in To set up the virtual machine (let this be VM2).
·         Update the guest operating system or applications in VM2.
·         Convert VM2 back to template. For more information
Note: There is a possibility that the resulting virtual machine and template will not be in sync if you use different update steps. This means that the template will not be a true clone of the virtual machine.
Backing up FT virtual machines using storage snapshots
To back up a FT virtual machine using storage snapshots:
 Note: Storage snapshotting is a feature provided by the backend storage array and is different than VMware ESX snapshots.
·         Using storage snapshotting, snapshot the virtual machine files. For more information, consult your storage vendor documentation.
·         Register the new virtual machine .vmx file to another ESX host, but do not turn on FT for this virtual machine.
·         Use VMware Consolidated Backup to back up the newly registered non-FT virtual machine.
11.   Testing a VMware Fault Tolerance configuration
Symptoms
·         Fault Tolerance failure testing provides inconsistent results
·         Fault Tolerance testing only functions with a full host failure
Purpose
For configuration and troubleshooting purposes it may be necessary to test the Fault Tolerance feature of vCenter Server.
Resolution
Overview
VMware Fault Tolerance provides continuous availability to virtual machines by keeping a secondary protected virtual machine up and running and in sync in case a complete ESX host failure occurs in the environment.

However, some ESX host component failures may not cause complete server failure. In these cases, Fault Tolerance may appear to behave inconsistently.

Note: VMware recommends that you configure the Fault Tolerance logging NIC to use its own dedicated 1GB+ NIC.
Fault Tolerance failure scenarios
Currently, Fault Tolerance failures are only triggered when there is no communication between the primary and secondary virtual machines.

These three scenarios may occur:
·         A deterministic scenario, where you can predict how a failover will occur

These events are deterministic:
o   An ESX host failure which causes complete host failure
o   The primary virtual machine process fails (or is non-responsive) on the ESX host
o   A Fault Tolerance test is initiated from vCenter Server
·         A reactionary scenario, where a failover may occur but you do not know the expected outcome ahead of time

These events are reactionary:
o   Fault Tolerance logging NIC communication is interrupted or fails
o   Fault Tolerance logging NIC communication is very slow
Reactionary events are not predictable because there is a race between the primary and secondary virtual machines to see which will go live. The virtual machine that wins the race stays alive and the other is terminated. The race prevents a split brain scenario that can cause data corruption. In these cases you may see inconsistent results depending on the host that wins the ownership of the virtual machine.
·         A no action taken scenario, where no failover occurs because Fault Tolerance does not monitor for this type of event.

Fault Tolerance does not currently detect or respond to events which are not directly involved with its operation. No action is taken for these events:
o   Management network interruption or failure
o   Virtual machine network interruption or failure
o   HBA failures that do not affect the entire host
o   Any combination of the above
Testing Fault Tolerance
To test VMware Fault Tolerance properly, communication between the primary and secondary virtual machines must fail. VMware provides a Test Failover function from the virtual machine, which is the best option for testing VMware Fault Tolerance failover.  If you want to perform manual failover tests, only deterministic events produce reliable results. Reactionary or no action taken scenarios can produce unexpected results.

These are proper testing scenarios with their expected outcomes:

Note: These tests assume two hosts, Host A and Host B, with the primary fault tolerant virtual machine running on Host A, and the secondary virtual machine running on Host B.
·         Select the Test Failover Function from the Fault Tolerance menu on the virtual machine.

This tests the Fault Tolerance functionally in a fully-supported and non-invasive way. In this scenario, the virtual machine fails over from Host A to Host B, and a secondary virtual machine is started back up again. VMware HA failure does not occur in this case.
·         Host A complete failover

This scenario can be accomplished by pulling the host power cable, rebooting the host, or powering off the host from a remote KVM (such as iLO, DRAC, or RSA). The secondary virtual machine on Host B takes over immediately and continues to process information for the virtual machine. VMware HA failover occurs.
·         Virtual machine process on Host A fails

This scenario can be accomplished by terminating the active process for the virtual machine by logging into Host A. The secondary virtual machine takes over and no VMware HA failure occurs. VMware does not recommend testing in this way. For more information on terminating a virtual machine.

Fault Tolerance Error Messages
Configuration Error Messages
This table lists some of the error messages you can encounter if your host or cluster is not configured appropriately to support FT:

Configuration Errors
Error Message
Description and Solution
Host CPU is incompatible with the virtual machine's requirements. Mismatch detected for these features: CPU does not match
FT requires that the hosts for the Primary and Secondary virtual machines use the same type of CPU. Enable FT on a virtual machine registered to a host with a matching CPU model, family, and stepping within the cluster. If no such hosts exist, you must add one. This error also occurs when you attempt to migrate a fault tolerant virtual machine to a different host.
The Fault Tolerance configuration of the entity {entityName} has an issue: Fault Tolerance not supported by host hardware
FT is only supported on specific processors and BIOS settings with Hardware Virtualization (HV) enabled. To resolve this issue, use hosts with supported CPU models and BIOS settings.
Virtual Machine ROM is not supported
The virtual machine is running VMI kernel and is paravirtualized. VMI is not supported by FT and should be disabled for the virtual machine.
Host {hostName} has some Fault Tolerance issues for virtual machine {vmName}. Refer to the errors list for details
To troubleshoot this issue, in the vSphere Client select the failed FT operation in either the Recent Tasks pane or the Tasks & Events tab and click the View details link that appears in the Details column.
The Fault Tolerance configuration of the entity {entityName} has an issue: Check host certificates flag not set for vCenter Server
The "check host certificates" box is not checked in the SSL settings for vCenter Server. You must check that box.
The Fault Tolerance configuration of the entity {entityName} has an issue: HA is not enabled on the virtual machine
This virtual machine is on a host that is not in a vSphere HA cluster or it has had vSphere HA disabled. Fault Tolerance requires vSphere HA.
The Fault Tolerance configuration of the entity {entityName} has an issue: Host is inactive
You must enable FT on an active host. An inactive host is one that is disconnected, in maintenance mode, or in standby mode.
Fault Tolerance has not been licensed on host {hostName}.
Fault Tolerance is not licensed in all editions of VMware vSphere. Check the edition you are running and upgrade to an edition that includes Fault Tolerance.
The Fault Tolerance configuration of the entity {entityName} has an issue: No vMotion license or no virtual NIC configured for vMotion
Verify that you have correctly configured networking on the host. If you have, then you might need to acquire a vMotion license.
The Fault Tolerance configuration of the entity {entityName} has an issue: No virtual NIC configured for Fault Tolerance logging
An FT logging NIC has not been configured.
Host {hostName} does not support virtual machines with Fault Tolerance turned on. This VMware product does not support Fault Tolerance
The product you are using is not compatible with Fault Tolerance. To use the product you must turn Fault Tolerance off. This error message primarily appears when vCenter Server is managing a host with an earlier version of ESXi/ESX or if you are using VMware Server.
The Fault Tolerance configuration of the entity {entityName} has an issue: Fault Tolerance not supported by VMware Server 2.0
Upgrade to VMware ESXi/ESX 4.1 or later.
The build or Fault Tolerance feature version on the destination host is different from the current build or Fault Tolerance feature version: {build}.
FT feature versions must be the same on current and destination hosts. Choose a compatible host or upgrade incompatible hosts.



Virtual Machine Configuration Errors

There are a number of virtual machine configuration issues that can generate error messages. These are two error messages you might see if the virtual machine configuration does not support FT:
  • The Fault Tolerance configuration of the entity {entityName} has an issue: The virtual machine's current configuration does not support Fault Tolerance
  • The Fault Tolerance configuration of the entity {entityName} has an issue: Record and replay functionality not supported by the virtual machine
FT only runs on a virtual machine with a single vCPU. You might encounter these errors when attempting to turn on FT on a multiple vCPU virtual machine:
  • The virtual machine has {numCpu} virtual CPUs and is not supported for reason: Fault Tolerance
  • The Fault Tolerance configuration of the entity {entityName} has an issue: Virtual machine with multiple virtual CPUs
Fault Tolerance does not inter-operate with some vSphere features. If you attempt to turn on FT on a virtual machine using a vSphere feature which FT does not support, you might see one of these error messages. To use FT, you must disable the vSphere feature on the relevant virtual machine or enable FT on a virtual machine not using these features.
  • The Fault Tolerance configuration of the entity {entityName} has an issue: The virtual machine has one or more snapshots
  • The Fault Tolerance configuration of the entity {entityName} has an issue: Template virtual machine
These error messages might occur if your virtual machine has an unsupported device. To enable FT on this virtual machine, remove the unsupported device(s), and turn on FT.
  • The file backing ({backingFilename}) for device Virtual disk is not supported for Fault Tolerance
  • The file backing ({backingFilename}) for device Virtual Floppy is not supported for Fault Tolerance
  • The file backing ({backingFilename}) for device Virtual CDROM is not supported for Fault Tolerance
  • The file backing ({backingFilename}) for device Virtual serial port is not supported for Fault Tolerance
  • The file backing ({backingFilename}) for device Virtual parallel port is not supported for Fault Tolerance
  • The Fault Tolerance configuration of the entity <VM Name> has an issue: The virtual machine has a video device with 3D enabled

This table lists other virtual machine configuration errors:

Other Virtual Machine Configuration Issues
Error Message
Description and Solution
The specified host is not compatible with the Fault Tolerance Secondary VM.
Refer to vSphere Troubleshooting for possible causes of this error.
No compatible host for the Secondary VM {vm.name}
Refer to vSphere Troubleshooting for possible causes of this error.
The virtual machine's disk {device} is using the {mode} disk mode which is not supported.
The virtual machine has one or more hard disks configured to use Independent mode. Edit the setting of the virtual machine, select each hard disk, and deselect Independent mode. Verify with your system administrator that this is acceptable for the environment.
The unused disk blocks of the virtual machine's disks have not been scrubbed on the file system. This is needed to support features like Fault Tolerance
You have attempted to turn on FT for a powered-on virtual machine which has thick-formatted disks with the property of being lazy-zeroed. FT cannot be enabled on such a virtual machine while it is powered on. Power off the virtual machine, then turn on FT and power the virtual machine back on. This changes the disk format of the virtual machine when it is powered back on. Turning on FT could take some time to complete if the virtual disk is large.
The disk blocks of the virtual machine's disks have not been fully provisioned on the file system. This is needed to support features like Fault Tolerance
You have attempted to turn on FT for a powered-on virtual machine with thin-provisioned disks. FT cannot be enabled on such a virtual machine while it is powered on. Power off the virtual machine, then turn on FT and power the virtual machine back on. This changes the disk format of the virtual machine when it is powered back on. Turning on FT could take some time to complete if the virtual disk is large.


Operational Errors
This table lists error messages you might encounter while using fault tolerant virtual machines:

Operational Errors
Error Message
Description and Solution
No suitable host can be found to place the Fault Tolerance Secondary VM for virtual machine {vmName}
FT requires that the hosts for the Primary and Secondary virtual machines use the same CPU model or family and have the same FT version number or host build number and patch level. Enable FT on a virtual machine registered to a host with a matching CPU model or family within the cluster. If no such hosts exist, you must add one.
The Fault Tolerance Secondary VM was not powered on because the Fault Tolerance Primary VM could not be powered on.
vCenter Server will report why the primary could not be powered on. Correct the conditions and then retry the operation.
Operation to power On the Fault Tolerance Secondary VM for {vmName} could not be completed within {timeout} seconds
Retry the Secondary virtual machine power on. The timeout can occur because of networking or other transient issues.
vCenter disabled Fault Tolerance on VM {vmName} because the Secondary VM could not be powered on
To diagnose why the Secondary virtual machine could not be powered on, see vSphere Troubleshooting.
Resynchronizing Primary and Secondary VMs
Fault Tolerance has detected a difference between the Primary and Secondary virtual machines. This can be caused by transient events which occur due to hardware or software differences between the two hosts. FT has automatically started a new Secondary virtual machine, and no action is required. If you see this message frequently, you should alert support to determine if there is an issue.
The Fault Tolerance configuration of the entity {entityName} has an issue: No configuration information for the virtual machine
vCenter Server has no information about the configuration of the virtual machine. Determine if it is misconfigured. You can try removing the virtual machine from the inventory and re-registering it.
Cannot change the vSphere HA settings for Fault Tolerance Secondary VM {vmName}
The vSphere HA settings for a Secondary virtual machine cannot be changed, because it has the same settings as its Primary virtual machine. Always change only the settings of the Primary virtual machine.
Cannot change the DRS behavior for Fault Tolerance Secondary VM {vmName}.
You cannot change the DRS behavior of a Secondary virtual machine. This configuration is inherited from the Primary virtual machine.
Virtual machines in the same Fault Tolerance pair cannot be on the same host
You have attempted to migrate a Secondary virtual machine to the same host a Primary virtual machine is on. A Primary virtual machine and its Secondary virtual machine cannot reside on the same host. Select a different destination host for the Secondary virtual machine.
Cannot add a host with virtual machines that have Fault Tolerance turned On to a non-HA enabled cluster
FT requires the cluster to be enabled for vSphere HA. Edit your cluster settings and turn on vSphere HA.
Cannot add a host with virtual machines that have Fault Tolerance turned On as a stand-alone host
Turn off Fault Tolerance before adding the host as a standalone host to vCenter Server. To turn off FT, right-click each virtual machine on the host and select Turn Off Fault Tolerance. Then you can add the host as a stand-alone host.
Cannot set the HA restart priority to 'Disabled' for the Fault Tolerance VM {vmName}.
This setting is not allowed for an FT virtual machine. You only see this error if you change the restart priority of an FT virtual machine to Disabled.
Host already has the recommended number of {maxNumFtVms} Fault Tolerance VMs running on it
To power on or migrate more FT virtual machines to this host, either move one of the existing Fault Tolerance virtual machines to another host or disable this restriction by setting the vSphere HA advanced option das.maxftvmsperhost to 0.
Operations to test Fault Tolerance by terminating the primary VM or secondary VM are not allowed for the Fault Tolerance VM {vmName} at this time, because it is not protected by vSphere HA yet and therefore no action will be taken to recover Fault Tolerance protection for this VM
You tried to test failover functionality or attempted the Restart Secondary task on a virtual machine that is not protected by vSphere HA. Do not attempt these tasks until the virtual machine is protected by vSphere HA.


SDK Operational Errors
This table lists error messages you might encounter while using the SDK to perform operations:

SDK Operational Errors
Error Message
Description and Solution
This operation is not supported on a Secondary VM of a Fault Tolerant pair
An unsupported operation was performed directly on the Secondary virtual machine using the API. FT does not allow direct interaction with the Secondary virtual machine (except for relocating or migrating it to a different host).
The Fault Tolerance configuration of the entity {entityName} has an issue: Secondary VM already exists
The Primary virtual machine already has a Secondary virtual machine. Do not attempt to create multiple Secondary virtual machines for the same Primary virtual machine.
The Secondary VM with instanceUuid '{instanceUuid}' has already been enabled
An attempt was made to enable FT for a virtual machine on which FT was already enabled. Typically, such an operation would come from an API.
The Secondary VM with instanceUuid '{instanceUuid}' has already been disabled
An attempt was made to disable FT for a Secondary VM on which FT was already disabled. Typically, such an operation would come from an API.









VMware Fault Tolerance FAQ

1.       What is VMware Fault Tolerance?
VMware Fault Tolerance is a feature that allows a new level of guest redundancy. 
2.       How do I turn it on?
The feature is enabled on a per virtual machine basis.  
3.       What happens when I turn on Fault Tolerance?
In very general terms, a second virtual machine is created to work in tandem with the virtual machine you have enabled Fault Tolerance on. This virtual machine resides on a different host in the cluster, and runs in virtual lockstep with the primary virtual machine. When a failure is detected, the second virtual machine takes the place of the first one with the least possible interruption of service. 
4.       Why can't I turn Fault Tolerance on?
VMware Fault Tolerance can be enabled on any virtual machine that resides in a cluster that meets the necessary requirements.  
5.       How do I turn Fault Tolerance off?
Instructions for disabling Fault Tolerance can be found in the article in Disabling or Turning Off VMware FT (1008026).
6.       How do I tell if my environment is ready for Fault Tolerance?
The VMware SiteSurvey Tool is used to check your environment for compliance with VMware Fault Tolerance. It can be downloaded at http://www.vmware.com/download/shared_utilities.html.
7.       Where do I find the product's website?
VMware has a website for the Fault Tolerance product available online here  at http://www.vmware.com/products/fault-tolerance/.
8.       What happens during a failure?
When a host running the primary virtual machine fails, a transparent failover occurs to the corresponding secondary virtual machine. During this failover, there is no data loss or noticeable service interruption. In addition, VMware HA automatically restores redundancy by restarting a new secondary virtual machine on another host. Similarly, if the host running the secondary virtual machine fails, VMware HA starts a new secondary virtual machine on a different host. In either case there is no noticeable outage by an end user.
9.       What is the logging time delay between the Primary and Secondary Fault Tolerance virtual machines?
The actual delay is based on the network latency between the Primary and Secondary. vLockstep executes the same instructions on the Primary and Secondary, but because this happens on different hosts, there could be a small latency, but no loss of state. This is typically less than 1 ms. Fault Tolerance includes synchronization to ensure that the Primary and Secondary are synchronized.
10.   In a cluster with more than 3 hosts, can you tell Fault Tolerance where to put the Fault Tolerance virtual machine or does it chose on its own?
You can place the original (or Primary virtual machine). You have full control with DRS or VMotion to assign to it to any node. The placement of the Secondary, when created, is automatic based on the available hosts. But when the secondary is created and placed, you can VMotion it to the preferred host.
11.   What happens if the host containing the primary virtual machine comes back online (after a node failure)?
This node is put back in the pool of available hosts. There is no attempt to start or migrate the primary to that host.
12.   Is the failover from the primary virtual machine to the secondary virtual machine dynamic or does Fault Tolerance restart a virtual machine?
The failover from primary to secondary virtual machine is dynamic, with the secondary continuing execution from the exact point where the primary left off. It happens automatically with no data loss, no downtime, and little delay. Clients see no interruption. After the dynamic failover to the secondary virtual machine, it becomes the new primary virtual machine. A new secondary virtual machine is spawned automatically
13.   Where are Fault Tolerance failover events logged?
All failover events are logged by vCenter.
14.   Does Fault Tolerance support Intel Hyper-Threading Technology?
Yes, Fault Tolerance does support Intel Hyper-Threading Technology on systems that have it enabled. Enabling or disabling Hyper-Threading has no impact on Fault Tolerance.
15.   What happens if vCenter Server is offline when a failover event occurs?
Once Fault Tolerance is configured for a virtual machine, vCenter Server need not be online for FT to work. Even if vCenter Server is offline, failover will still occur from the primary to the secondary virtual machine. Additionally, the spawning of a new secondary virtual machine will also occur without vCenter Server.

By IT Operations-Wiki on Tuesday, 7 January 2014 | , | A comment?