VMware Fault Tolerance (FT)
Introduction
to VMware Fault Tolerance (FT)
·
FT provides continuous availability for a VM
o
Zero down time
·
Takes VMHA to the next level
·
Works for all applications and 99% of guest OS.
·
Does this by creating a "live shadow"
copy of the running VM then keeping them in "lockstep" using VMware's
vLockstep.
·
If an ESX server fails, the shadow will take
over ad a new shadow will be created in the cluster on another ESX server
·
Primary VM is called the "Primary" and
the copied/lockstep VM is the secondary
·
The virtual disk for the VM is on shared storage
and never moves
·
"Continuous VMotion"
Requirements
of FT
·
CPUs on all FT ESXi servers must match and be
from a specific list of processors
·
Hardware Virtualization enabled in the BIOS
·
Recommended minimum # of 1GB NICs=3
·
One NIC on each Server must enabled for FT
logging and vMotion
·
ESXi servers must be running same build
·
VMs on shared SAN,accessible by servers
·
Must be enabled in a HA cluster
·
vSphere Enterprise or Enterprise Plus
Cluster
Requirements
·
Host Certificate checking must be enabled
·
At least 2 FT-certified hosts running the same
FT version or host build number.
·
Hosts need access to the same storage
·
FT Logging and VMotion Networking need to be
configured.
·
HA must be enabled on the cluster. If it isn't
you will not be able to power on an FT machine or add a host running an FT
machine already to the cluster.
Host
Requirements
·
Must contain processors from the FT-compatible
processor group. Highly recommended that CPUs are also compatible with one
another.
·
Must be licensed for FT (Enterprise or
Enterprise Plus)
·
Must be certified for FT (HCL).
·
BIOS must have Hardware Virtualization (HV)
enabled.
VM Requirements
·
Virtual disks must either be in virtual RDM mode
or VMDK files (no physical RDM). The disk must also be in thick format.
·
VM files must be stored on shared storage (FC,
FCOE, iSCSI, NFS, NAS).
·
Cannot have more than one cpu.
·
Must be running on Windows 7, Windows Server
2008, Vista, 2003, XP, 2000, NT 4, All Linux supported by ESX, Netware, solaris
10, and FreeBSD ( there are some limitations on processors though, so check
them out).
Constraints
of FT
·
Single vCPU in each VM only (no SMP)
·
Require Specific hardware
·
Recommended minimumof 4 VMs running FT on an ESX
server
·
"Line of site" between ESXi servers
due to latency
·
Only thick disk is supported
·
Snapshots are not allowed(include via VADP
backup products)
·
Cannot invoke a svMotion on a VM with FT enabled
·
Linked clones are not allowed on a VM with FT
enabled
·
Some guests not supported and some guests
require shutdown to enable
The
following is not supported with FT
·
Snapshots
·
Storage vMotion
·
Linked Clones
·
Cannot backup an FT machine using the Storage API
for Data Protection, VMware Data Recovery. Array based snapshots however do not
affect it.
·
Cannot use a floppy or cdrom backed by physical
or remote device (only shared storage img and isoimages).
·
USB and sound devices
·
NPIV
·
NIC passthrough
·
vlance networking drivers
·
No Hot plugable features (includes changing
attached networks).
·
EPT/RVI
·
Serial or parallel ports
·
IPv6
·
3D enabled video drivers.
Testing to
see if you can use FT with VMware Site Survey
·
Site Survey saves time by automating this check
·
Run Site Survey on your cluster to see if you
can use FT
Enabling
VMware FT
·
Once requirements have been met, enabling FT is
easy
o
Right Click on a VM
o
Go to Fault Tolerance
o
Click Turn on Fault Tolerance
Configure
VMware Fault Tolerance networking
Prerequisites
·
Multiple Gigabit NICs. Each host will need at
least two, one for FT Logging and one for vMotion.
·
Configuring the networking is quite easy,
essentially create two vmkernel ports, one for vMotion and one for FT Logging.
*** NOTE *** The FT traffic is not encrypted, so secure this network as best
you can, probably best to have a private network.
·
After you have created the vmkernel port for FT
logging your hosts summary tab should show 'Configured for FT'. If there is an
issue, the little blue comment box will display what it is as your hover over
it.
Configure VMkernel NIC
Enable/Disable
VMware Fault Tolerance on a virtual machine
Enable
Fault Tolerance
This is actually quite easy. Right click a
VM and select 'Fault Tolerance' -> 'Enable Fault Tolerance'
This option may be dimmed if
·
The VM is registered on a host that isn't
licensed for FT
·
The VM is on a host that is in maintenance or
standby
·
The VM is disconnected or orphaned
·
The user doesn't have the permission to do this.
After selecting Enable Fault Tolerance the
following validation checks are performed
·
SSL certification checking is enabled
·
The host is in a vSphere HA cluster or mixed HA
and DRS cluster
·
host has ESX(i) 4.0 or greater installed
·
VM doesn't have multiple CPUs, snapshots, ha
disabled or a 3d video device.
·
Checks the BIOS for HV
·
Checks processors for primary and secondary
·
Checks processors in conjunction with the OS
The following occurs when enabling FT
·
A secondary VM is created. The placement and
status of this VM will vary depending on the power state of the primary VM
o
If Primary is Powered ON
§
Entire state of primary VM is copied and the
secondary is created, placed on a separate host and powered on (if it passes
admission control).
§
FT status on the VMs summary tab will be
'Protected'
o
If Primary is powered off
§
Secondary is immediately created and registered
to a host in the cluster ( could even be same host as primary but will be moved
on power on ).
§
Secondary VM will not be powered on until the
primary is powered on.
§
FT status will display 'Not Protected, VM not
Running'
·
Once Fault tolerance is enabled, vCenter will
remove the VMs memory limits and reservations and set a new memory reservation
equal to the memory size of the VM. While FT is enabled on this VM you cannot
change memory reservations, limits, size, or shares. If you disable FT, these
values are not reverted back.
Once enabled, the FT section in the summary
tab will show you the following
·
FT Status
o
Protected – Primary and secondary are powered on
and running as expected
o
Not Protected – Secondary VM is not running. It
will also provide a reason
·
Starting – FT is in the process of starting the
secondary.
·
Need Secondary VM – Primary VM is running
without a secondary. Normally caused by the
inability to create a secondary due to incompatible hosts. If there are
compatible hosts, sometimes disabling ft and re-enabling will fix this.
·
Disabled – FT is currently disabled ( occurs
when FT is disabled by the user or vCenter Server may disable FT after being
unable to power on the secondary).
·
VM Not Running – Ft is enabled, but primary is
powered off.
·
Secondary Location – shows which host is running
the secondary VM
·
Total Secondary CPU – shows the CPU usage of the
secondary VM (MHz)
·
Total Secondary Memory – shows the total memory
usage of the secondary (MB)
·
vLockstep Interval – The time interval in
seconds needed for the secondary VM to match the current execution state of the
Primary. Typically less than 1/2 a second. No state will be lost even if this
interval is high.
·
Log Bandwidth – Amount of network capacity used
to send FT log info from the host running the primary to the host running the
secondary.
To disable just right click and chose 'Fault Tolerance' -> 'Turn off
fault tolerance'
FT in the VMkernel
·
The FT vmkernel module is called vmklogger.
·
Log entries are put in the log buffer, which is
flushed/filled asynchronously.
·
Log entries are sent/received through socket on
VMkernel NIC.
·
There should be a dedicated VMkernel network for
logging which has FT Logging enabled.
Test an FT
configuration
VMware
provides a couple of FT scenario's that can be tested
Testing FT Failover
·
The secondary machine will become the new
primary, the old primary is then removed.
·
A new secondary machine will spawn up and sync
up with the new primary.
Testing Restart Secondary
·
This will destroy the current secondary VM and restart
another one.
·
The primary is unaffected during this test.
Determine
use case for enabling VMware Fault Tolerance on a virtual machine
There
are a number of use cases for Fault Tolerance. Its best to keep in mind that
Fault Tolerance however does not protect against an OS failure, or an
application failure, it simply protects against a host failure. Some use cases
for FT might include
·
Applications that need to be highly available
(especially those with long lasting client connections) that you want to
survive a hardware failure.
·
Custom built applications that have no other
form of clustering available.
·
It’s a simple way to provide HA to an
application and doesn't require difficult and complex setups like other clustering
solutions.
·
If you want to protect a key VM during a
critical time to ensure there would be no downtime if a host fails.
Viewing Information about Fault Tolerant VMs
·
Fault Tolerant VMs have an additional Fault
Tolerance pane on their summary tab which provides information about the Fault
Tolerance setup and performance.
·
Fault Tolerance Status - Indicates the status of
fault tolerance - Protected or Not Protected/Disabled.
·
Secondary Location - Displays the ESX/ESXi host
on which the secondary virtual machine is hosted.
·
Total Secondary CPU - Indicates all secondary
CPU usage, displayed in MHz.
·
Total Secondary Memory - Indicates all secondary
memory usage, displayed in MB.
·
Secondary VM Lag Time shows the current delay
between the primary and secondary VM.
·
Log Bandwidth shows the consumed bandwidth on
the link for Record/Replay operations between the primary and secondary VM.
o
This value is based on the FT operations only,
and is not the bandwidth usage on the wire (i.e with. TCP/IP/Ethernet headers).
FT Virtual Machine files
Maps View of an FT VM
Troubleshooting
Fault Tolerant Virtual Machines
·
To maintain a high level of performance
and stability for your fault tolerant virtual machines and also to minimize
failover rates, you should be aware of certain troubleshooting issues.
·
The troubleshooting topics discussed
focus on problems that you might encounter when using the vSphere Fault
Tolerance feature on your virtual machines.
1.
Hardware Virtualization Not Enabled
You must enable
Hardware Virtualization (HV) before you use vSphere Fault Tolerance.
Problem:
When you attempt to
power on a virtual machine with Fault Tolerance enabled, an error message might
appear if you did not enable HV.
Cause:
This error is often
the result of HV not being available on the ESXi server on which you are
attempting to power on the virtual machine. HV might not be available either
because it is not supported by the ESXi server hardware or because HV is not
enabled in the BIOS.
Solution
If the ESXi server
hardware supports HV, but HV is not currently enabled, enable HV in the BIOS on
that server. The process for enabling HV varies among BIOSes. See the
documentation for your hosts' BIOSes for details on how to enable HV.
If the ESXi server
hardware does not support HV, switch to hardware that uses processors that
support Fault Tolerance
2.
Compatible Hosts Not Available for Secondary VM
If you power on a
virtual machine with Fault Tolerance enabled and no compatible hosts are
available for its Secondary VM, you might receive an error message.
Problem
The following error
message might appear in the Recent Task Pane:
Secondary VM could
not be powered on as there are no compatible hosts that can accommodate it.
Cause
This can occur for a
variety of reasons including that
·
There are no other hosts in the cluster
·
There are no other hosts with HV
enabled
·
Data stores are inaccessible
·
There is no available capacity
·
Hosts are in maintenance mode.
Solution
·
If there are insufficient hosts, add
more hosts to the cluster.
·
If there are hosts in the cluster,
ensure they support HV and that HV is enabled. The process for enabling HV
varies among BIOSes. See the documentation for your hosts' BIOSes for details
on how to enable HV.
·
Check that hosts have sufficient
capacity
·
That they are not in maintenance mode
3.
Secondary VM on Overcommitted Host Degrades Performance of Primary
VM
If a Primary VM
appears to be executing slowly, even though its host is lightly loaded and
retains idle CPU
time, check the host
where the Secondary VM is running to see if it is heavily loaded.
Problem
When a Secondary VM
resides on a host that is heavily loaded, this can effect the performance of
the Primary VM.
Evidence of this
problem could be if the vLockstep Interval on the Primary VM's Fault Tolerance
panel is yellow or red. This means that the Secondary VM is running several
seconds behind the Primary VM. In such cases, Fault Tolerance slows down the
Primary VM. If the vLockstep Interval remains yellow or red for an extended
period of time, this is a strong indication that the Secondary VM is not
getting enough CPU resources to keep up with the Primary VM.
Cause
A Secondary VM
running on a host that is overcommitted for CPU resources might not get the
same amount of CPU resources as the Primary VM. When this occurs, the Primary
VM must slow down to allow the Secondary VM to keep up, effectively reducing
its execution speed to the slower speed of the Secondary VM.
Solution
To resolve this
problem, set an explicit CPU reservation for the Primary VM at a MHz value
sufficient to run its workload at the desired performance level. This
reservation is applied to both the Primary and Secondary VMs ensuring that both
are able to execute at a specified rate. For guidance setting this reservation,
view the performance graphs of the virtual machine (prior to Fault Tolerance
being enabled) to see how much CPU resources it used under normal condition
4.
Virtual Machines with Large Memory Can Prevent Use of Fault
Tolerance
You can only enable
Fault Tolerance on a virtual machine with a maximum of 64GB of memory.
Problem
Enabling Fault
Tolerance on a virtual machine with more than 64GB memory can fail. Migrating a
running
fault tolerant
virtual machine using vMotion also can fail if its memory is greater than 15GB
or if memory is changing at a rate faster than vMotion can copy over the
network.
Cause
This occurs if, due
to the virtual machine’s memory size, there is not enough bandwidth to complete
the
vMotion switchover
operation within the default timeout window (8 seconds).
Solution
To resolve this
problem, before you enable Fault Tolerance, power off the virtual machine and
increase its
timeout window by
adding the following line to the vmx file of the virtual machine:
ft.maxSwitchoverSeconds =
"30"
where 30 is the
timeout window in number in seconds. Enable Fault Tolerance and power the
virtual machine back on. This solution should work except under conditions of
very high network activity.
NOTE:
If you increase the
timeout to 30 seconds, the fault tolerant virtual machine might become
unresponsive
for a longer period
of time (up to 30 seconds) when enabling FT or when a new Secondary VM is
created after a failover.
5.
Secondary VM CPU Usage Appears Excessive
In some cases, you
might notice that the CPU usage for a Secondary VM is higher than for its
associated Primary VM.
Problem
When the Primary VM
is idle, the relative difference between the CPU usage of the Primary and
Secondary
VMs might seem large.
Cause
Replaying events
(such as timer interrupts) on the Secondary VM can be slightly more expensive
than recording them on the Primary VM. This additional overhead is small.
Solution
None needed.
Examining the actual CPU usage shows that very little CPU resource is being
consumed by the Primary VM or the Secondary VM.
6.
Primary VM Suffers Out of Space Error
If the storage system
you are using has thin provisioning built in, a Primary VM can crash when it
encounters an out of space error.
Problem
When used with a thin
provisioned storage system, a Primary VM can crash. The Secondary VM replaces
the Primary VM, but the error message "There is no more space for virtual
disk <disk_name>" appears on the vSphere client
Cause
If thin provisioning
is built into the storage system, it is not possible for ESX/ESXi hosts to know
if enough disk space has been allocated for a pair of fault tolerant virtual
machines. If the Primary VM asks for extra disk space but there is no space
left on the storage, the primary VM crashes.
Solution
The error message
gives you the choice of continuing the session by clicking "Retry" or
clicking "Cancel" to terminate the session. Ensure that there is
sufficient disk space for the fault tolerant virtual machine pair and click
"Retry"
7.
Fault Tolerant Virtual Machine Failovers
A Primary or
Secondary VM can fail over even though its ESXi host has not crashed. In such
cases, virtual
machine execution is
not interrupted, but redundancy is temporarily lost. To avoid this type of
failover, be
aware of some of the
situations when it can occur and take steps to avoid them.
Partial
Hardware Failure Related to Storage
This problem can
arise when access to storage is slow or down for one of the hosts. When this
occurs there are many storage errors listed in the VMkernel log. To resolve
this problem you must address your storage-related problems.
Partial
Hardware Failure Related to Network
If the logging NIC is
not functioning or connections to other hosts through that NIC are down, this
can trigger a fault tolerant virtual machine to be failed over so that
redundancy can be reestablished. To avoid this problem, dedicate a separate NIC
each for vMotion and FT logging traffic and perform vMotion migrations only
when the virtual machines are less active.
Insufficient
Bandwidth on the Logging NIC Network
This can happen
because of too many fault tolerant virtual machines being on a host. To resolve
this problem, more broadly distribute pairs of fault tolerant virtual machines
across different hosts.
vMotion
Failures Due to Virtual Machine Activity Level
If the vMotion
migration of a fault tolerant virtual machine fails, the virtual machine might
need to be failed over. Usually, this occurs when the virtual machine is too
active for the migration to be completed with only minimal disruption to the
activity. To avoid this problem, perform vMotion migrations only when the virtual
machines are less active.
Too
Much Activity on VMFS Volume Can Lead to Virtual Machine Failovers
When a number of file
system locking operations, virtual machine power ons, power offs, or vMotion
migrations occur on a
single VMFS volume, this can trigger fault tolerant virtual machines to be
failed over. A symptom that this might be occurring is receiving many warnings
about SCSI reservations in the VMkernel log. To resolve this problem, reduce
the number of file system operations or ensure that the fault tolerant virtual
machine is on a VMFS volume that does not have an abundance of other virtual
machines that are regularly being powered on, powered off, or migrated using
vMotion.
Lack
of File System Space Prevents Secondary VM Startup
Check whether or not
your /(root) or /vmfs/datasource file systems have available space. These file
systems can become full for many reasons, and a lack of space might prevent you
from being able to start a new Secondary VM.
8.
VMware Fault Tolerance fails to turn on in a two node cluster
Purpose
Running FT protected
virtual machines in a two node cluster is supported. Problems can
occur when there is a need to vMotion the primary virtual machine from one host
to the other. As the primary and secondary virtual machines cannot reside on
the same host, FT must be turned off so that the secondary virtual machines is
destroyed. The primary virtual machines can then be vMotioned to the other
host.
Resolution
This issue occurs
if the monitor mode changes during the vMotion process. FT requires the
monitor mode be set to Use Intel VT-x/AMD-V for instruction set
virtualization and software for MMU virtualization for the monitor mode to
not change during the vMotion process.
The default setting
for the virtual machine monitor mode is Automatic and FT sets the monitor mode
appropriately behind the scenes. If the hosts in the cluster support the Use
Intel VT-x/AMD-V for instruction set virtualization and Intel EPT/AMD RVI for
MMU virtualization option, the monitor mode is changed to this during the
vMotion process.
To set the monitor
mode explicitly to Use Intel VT-x/AMD-V for instruction set virtualization
and software for MMU virtualization:
Note: In some
instances, the virtual machine needs to be powered off in order to change the
monitor mode.
·
Right-click the virtual machine in
question and choose Edit Settings.
·
In the virtual machine Properties
window, click Options and select the CPU/MMU Virtualization option
under the Advanced heading.
·
Select the radio button next to Use
Intel VT-x/AMD-V for instruction set virtualization and software for MMU
virtualization.
·
Click OK.
·
For the setting to take effect, the
virtual machine needs to be power cycled or vMotion to another host. When this
is complete, FT can be turned on for the virtual machine.
9.
Processors and guest operating systems that support VMware Fault
Tolerance
Details
VMware Fault
Tolerance (FT) requires specific processors (CPUs) and guest operating systems.
Solution
Processors
VMware collaborated
with AMD and Intel in providing an efficient vSphere FT capability on modern
x86 processors. The collaboration required changes in both the performance
counter architecture and virtualization hardware assists of both Intel and AMD
and these have been included in all processors launched since early 2008.
For vSphere FT to be
supported, the ESXi servers that host the Primary VM and Secondary VM must both
use compatible processors. Compatible processors share the same Fault Tolerant
Compatible Set as shown in the VMware Compatibility Guide (See http://www.vmware.com/resources/compatibility).
Processors in different Fault Tolerant Compatible Sets are not compatible.
In general, a Fault
Tolerant Compatible Set comprises processors within the same CPU vendor
generation (for example, Intel Nehalem). However, processors across different
generations (for example, Intel Westmere with Intel Nehalem) are not FT
compatible. This also means that vSphere FT does not support cross
compatibility between Intel and AMD processors. You cannot pair Intel and AMD
processors for FT virtual machines.
Lastly, some
processors are only FT compatible with themselves. Those are shown as belonging
to the Only With Itself set .
Guest Operating Systems
All guest operating
systems supported with ESXi are supported with vSphere FT unless noted below.
For specific guest operating system version information
Guest Operating
System
|
Notes or
Limitations
|
Windows 7
|
Requires VMware
vSphere 4.0 Update 1 or greater.
|
Windows Server 2003
(32 bit)
|
Requires Service
Pack 2 or greater when AMD Opteron Barcelona processor type is
used.
|
Windows XP (32 bit)
|
AMD Opteron
Barcelona processor type is not supported.
|
Windows 2000
|
AMD Opteron
Barcelona processor type is not supported.
|
Windows NT 4.0
|
AMD Opteron
Barcelona processor type is not supported.
|
Solaris 10 (64-bit)
|
Requires Solaris
U1 when AMD Barcelona processor type is used.
|
Solaris 10 (32-bit)
|
AMD Opteron Barcelona
processor type is not supported.
|
Note:
System vendors are certifying that their systems work with FT. You can find
details on the FT-certified systems at http://www.vmware.com/resources/compatibility.
More systems are being certified all the time, so check back if your platform
is not currently listed.
10.
Backing up Fault Tolerance virtual machines
Purpose
Back up Fault
Tolerance (FT) virtual machines.
Note: As taking snapshots
of FT virtual machines is not supported, VMware Consolidated Backup is also not
supported for FT virtual machines.
Resolution
VMware FT does not
support virtual machine snapshots in vSphere 4.x and 5.x However, to protect
against storage failure or data corruption, you can back up FT virtual machine
using templates and using storage snapshots.
Backing
up FT virtual machines using templates
To set up the virtual
machine:
·
Before turning on FT for the virtual
machine, clone a template of the virtual machine. For more information,
see Working with Templates and Clones in the vSphere Basic
Administration Guide.
·
Turn on FT for the virtual machine. For
more information, see Turning on Fault Tolerance for Virtual Machines in
the vSphere Availability Guide.
·
Turn on your in-guest backup
application for the FT virtual machine. For more information, consult your
vendor documentation.
On
recovery:
·
Deploy the template that you created in
step 1 of the previous section to a virtual machine. For more information, see Working
with Templates and Clones in the vSphere Basic Administration Guide.
·
Use your in-guest backup application to
recover the data to this new virtual machine. For more information, consult
your vendor documentation.
·
Turn on FT for the virtual machine. For
more information, see Turning on Fault Tolerance for Virtual Machines
in the vSphere Availability Guide.
·
Turn on your in-guest backup
application for the FT virtual machine.
For
updates to guest operating systems or applications that can tolerate FT being
temporarily turned off:
·
Update the guest operating system or
applications in the virtual machine.
·
Turn off FT.
·
Follow steps 1-3 in To set up the
virtual machine.
For
updates to guest operating systems or applications that cannot tolerate FT
being turned off:
·
Update the guest operating system or
applications in the virtual machine.
·
Deploy the template that you created in
To set up the virtual machine (let this be VM2).
·
Update the guest operating system or
applications in VM2.
·
Convert VM2 back to template. For more
information
Note: There is a
possibility that the resulting virtual machine and template will not be in sync
if you use different update steps. This means that the template will not
be a true clone of the virtual machine.
Backing
up FT virtual machines using storage snapshots
To back up a FT
virtual machine using storage snapshots:
Note:
Storage snapshotting is a feature provided by the backend storage array
and is different than VMware ESX snapshots.
·
Using storage snapshotting, snapshot
the virtual machine files. For more information, consult your storage vendor
documentation.
·
Register the new virtual machine .vmx
file to another ESX host, but do not turn on FT for this virtual machine.
·
Use VMware Consolidated Backup to back
up the newly registered non-FT virtual machine.
11.
Testing a VMware Fault Tolerance configuration
Symptoms
·
Fault Tolerance failure testing
provides inconsistent results
·
Fault Tolerance testing only functions
with a full host failure
Purpose
For configuration and
troubleshooting purposes it may be necessary to test the Fault Tolerance
feature of vCenter Server.
Resolution
Overview
VMware Fault Tolerance
provides continuous availability to virtual machines by keeping a secondary
protected virtual machine up and running and in sync in case a complete ESX
host failure occurs in the environment.
However, some ESX host component failures may not cause complete server failure. In these cases, Fault Tolerance may appear to behave inconsistently.
Note: VMware recommends that you configure the Fault Tolerance logging NIC to use its own dedicated 1GB+ NIC.
However, some ESX host component failures may not cause complete server failure. In these cases, Fault Tolerance may appear to behave inconsistently.
Note: VMware recommends that you configure the Fault Tolerance logging NIC to use its own dedicated 1GB+ NIC.
Fault
Tolerance failure scenarios
Currently, Fault
Tolerance failures are only triggered when there is no communication between
the primary and secondary virtual machines.
These three scenarios may occur:
These three scenarios may occur:
·
A deterministic scenario, where you can
predict how a failover will occur
These events are deterministic:
These events are deterministic:
o
An ESX host failure which causes
complete host failure
o
The primary virtual machine process
fails (or is non-responsive) on the ESX host
o
A Fault Tolerance test is initiated
from vCenter Server
·
A reactionary scenario, where a
failover may occur but you do not know the expected outcome ahead of time
These events are reactionary:
These events are reactionary:
o
Fault Tolerance logging NIC
communication is interrupted or fails
o
Fault Tolerance logging NIC
communication is very slow
Reactionary events are not predictable
because there is a race between the primary and secondary virtual machines to
see which will go live. The virtual machine that wins the race stays alive and
the other is terminated. The race prevents a split brain scenario that can
cause data corruption. In these cases you may see inconsistent results
depending on the host that wins the ownership of the virtual machine.
·
A no action taken scenario, where no
failover occurs because Fault Tolerance does not monitor for this type of
event.
Fault Tolerance does not currently detect or respond to events which are not directly involved with its operation. No action is taken for these events:
Fault Tolerance does not currently detect or respond to events which are not directly involved with its operation. No action is taken for these events:
o
Management network interruption or
failure
o
Virtual machine network interruption or
failure
o
HBA failures that do not affect the
entire host
o
Any combination of the above
Testing
Fault Tolerance
To test VMware Fault
Tolerance properly, communication between the primary and secondary virtual
machines must fail. VMware provides a Test Failover function from the virtual
machine, which is the best option for testing VMware Fault Tolerance
failover. If you want to perform manual failover tests, only
deterministic events produce reliable results. Reactionary or no action taken
scenarios can produce unexpected results.
These are proper testing scenarios with their expected outcomes:
Note: These tests assume two hosts, Host A and Host B, with the primary fault tolerant virtual machine running on Host A, and the secondary virtual machine running on Host B.
These are proper testing scenarios with their expected outcomes:
Note: These tests assume two hosts, Host A and Host B, with the primary fault tolerant virtual machine running on Host A, and the secondary virtual machine running on Host B.
·
Select the Test Failover Function from
the Fault Tolerance menu on the virtual machine.
This tests the Fault Tolerance functionally in a fully-supported and non-invasive way. In this scenario, the virtual machine fails over from Host A to Host B, and a secondary virtual machine is started back up again. VMware HA failure does not occur in this case.
This tests the Fault Tolerance functionally in a fully-supported and non-invasive way. In this scenario, the virtual machine fails over from Host A to Host B, and a secondary virtual machine is started back up again. VMware HA failure does not occur in this case.
·
Host A complete failover
This scenario can be accomplished by pulling the host power cable, rebooting the host, or powering off the host from a remote KVM (such as iLO, DRAC, or RSA). The secondary virtual machine on Host B takes over immediately and continues to process information for the virtual machine. VMware HA failover occurs.
This scenario can be accomplished by pulling the host power cable, rebooting the host, or powering off the host from a remote KVM (such as iLO, DRAC, or RSA). The secondary virtual machine on Host B takes over immediately and continues to process information for the virtual machine. VMware HA failover occurs.
·
Virtual machine process on Host A fails
This scenario can be accomplished by terminating the active process for the virtual machine by logging into Host A. The secondary virtual machine takes over and no VMware HA failure occurs. VMware does not recommend testing in this way. For more information on terminating a virtual machine.
This scenario can be accomplished by terminating the active process for the virtual machine by logging into Host A. The secondary virtual machine takes over and no VMware HA failure occurs. VMware does not recommend testing in this way. For more information on terminating a virtual machine.
Fault Tolerance Error Messages
Configuration Error Messages
This table lists some of the error messages you can encounter if your host or cluster is not configured appropriately to support FT:
Configuration Errors
This table lists some of the error messages you can encounter if your host or cluster is not configured appropriately to support FT:
Configuration Errors
Error Message
|
Description and
Solution
|
Host CPU is incompatible with the virtual machine's
requirements. Mismatch detected for these features: CPU does not match
|
FT requires that the hosts for the Primary and Secondary
virtual machines use the same type of CPU. Enable FT on a virtual machine
registered to a host with a matching CPU model, family, and stepping within
the cluster. If no such hosts exist, you must add one. This error also occurs
when you attempt to migrate a fault tolerant virtual machine to a different
host.
|
The Fault Tolerance configuration of the entity {entityName}
has an issue: Fault Tolerance not supported by host hardware
|
FT is only supported on specific processors and BIOS
settings with Hardware Virtualization (HV) enabled. To resolve this issue,
use hosts with supported CPU models and BIOS settings.
|
Virtual Machine ROM is not supported
|
The virtual machine is running VMI kernel and is
paravirtualized. VMI is not supported by FT and should be disabled for the
virtual machine.
|
Host {hostName} has some Fault Tolerance issues for
virtual machine {vmName}. Refer to the errors list for details
|
To troubleshoot this issue, in the vSphere Client select
the failed FT operation in either the Recent Tasks pane or the Tasks & Events tab and
click the View details link that
appears in the Details column.
|
The Fault Tolerance configuration of the entity
{entityName} has an issue: Check host certificates flag not set for vCenter
Server
|
The "check host certificates" box is not checked
in the SSL settings for vCenter Server. You must check that box.
|
The Fault Tolerance configuration of the entity
{entityName} has an issue: HA is not enabled on the virtual machine
|
This virtual machine is on a host that is not in a vSphere
HA cluster or it has had vSphere HA disabled. Fault Tolerance requires
vSphere HA.
|
The Fault Tolerance configuration of the entity
{entityName} has an issue: Host is inactive
|
You must enable FT on an active host. An inactive host is
one that is disconnected, in maintenance mode, or in standby mode.
|
Fault Tolerance has not been licensed on host {hostName}.
|
Fault Tolerance is not licensed in all editions of VMware
vSphere. Check the edition you are running and upgrade to an edition that
includes Fault Tolerance.
|
The Fault Tolerance configuration of the entity
{entityName} has an issue: No vMotion license or no virtual NIC configured
for vMotion
|
Verify that you have correctly configured networking on
the host. If you have, then you might need to acquire a vMotion license.
|
The Fault Tolerance configuration of the entity
{entityName} has an issue: No virtual NIC configured for Fault Tolerance
logging
|
An FT logging NIC has not been configured.
|
Host {hostName} does not support virtual machines with
Fault Tolerance turned on. This VMware product does not support Fault
Tolerance
|
The product you are using is not compatible with Fault
Tolerance. To use the product you must turn Fault Tolerance off. This error
message primarily appears when vCenter Server is managing a host with an
earlier version of ESXi/ESX or if you are using VMware Server.
|
The Fault Tolerance configuration of the entity
{entityName} has an issue: Fault Tolerance not supported by VMware Server 2.0
|
Upgrade to VMware ESXi/ESX 4.1 or later.
|
The build or Fault Tolerance feature version on the
destination host is different from the current build or Fault Tolerance
feature version: {build}.
|
FT feature versions must be the same on current and
destination hosts. Choose a compatible host or upgrade incompatible hosts.
|
Virtual Machine Configuration Errors
There are a number of virtual machine configuration issues
that can generate error messages. These are two error messages you might see if
the virtual machine configuration does not support FT:
The Fault Tolerance configuration of the entity {entityName} has an issue: The virtual machine's current configuration does not support Fault Tolerance
The Fault Tolerance configuration of the entity {entityName} has an issue: Record and replay functionality not supported by the virtual machine
FT only runs
on a virtual machine with a single vCPU. You might encounter these errors when
attempting to turn on FT on a multiple vCPU virtual machine:
The virtual machine has {numCpu} virtual CPUs and is not supported for reason: Fault Tolerance
The Fault Tolerance configuration of the entity {entityName} has an issue: Virtual machine with multiple virtual CPUs
Fault
Tolerance does not inter-operate with some vSphere features. If you attempt to
turn on FT on a virtual machine using a vSphere feature which FT does not
support, you might see one of these error messages. To use FT, you must disable
the vSphere feature on the relevant virtual machine or enable FT on a virtual
machine not using these features.
The Fault Tolerance configuration of the entity {entityName} has an issue: The virtual machine has one or more snapshots
The Fault Tolerance configuration of the entity {entityName} has an issue: Template virtual machine
These error
messages might occur if your virtual machine has an unsupported device. To
enable FT on this virtual machine, remove the unsupported device(s), and turn
on FT.
The file backing ({backingFilename}) for device Virtual disk is not supported for Fault Tolerance
The file backing ({backingFilename}) for device Virtual Floppy is not supported for Fault Tolerance
The file backing ({backingFilename}) for device Virtual CDROM is not supported for Fault Tolerance
The file backing ({backingFilename}) for device Virtual serial port is not supported for Fault Tolerance
The file backing ({backingFilename}) for device Virtual parallel port is not supported for Fault Tolerance
The Fault Tolerance configuration of the entity <VM Name> has an issue: The virtual machine has a video device with 3D enabled
This table lists other virtual machine configuration errors:
Other Virtual Machine Configuration Issues
Error Message
|
Description and Solution
|
The specified host is not compatible with the Fault
Tolerance Secondary VM.
|
Refer to vSphere Troubleshooting
for possible causes of this error.
|
No compatible host for the Secondary VM {vm.name}
|
Refer to vSphere Troubleshooting
for possible causes of this error.
|
The virtual machine's disk {device} is using the {mode}
disk mode which is not supported.
|
The virtual machine has one or more hard disks configured
to use Independent mode. Edit the setting of the virtual machine, select each
hard disk, and deselect Independent mode. Verify with your system
administrator that this is acceptable for the environment.
|
The unused disk blocks of the virtual machine's disks have
not been scrubbed on the file system. This is needed to support features like
Fault Tolerance
|
You have attempted to turn on FT for a powered-on
virtual machine which has thick-formatted disks with the property of being
lazy-zeroed. FT cannot be enabled on such a virtual machine while it is
powered on. Power off the virtual machine, then turn on FT and power the
virtual machine back on. This changes the disk format of the virtual machine
when it is powered back on. Turning on FT could take some time to complete if
the virtual disk is large.
|
The disk blocks of the virtual machine's disks have not
been fully provisioned on the file system. This is needed to support features
like Fault Tolerance
|
You have attempted to turn on FT for a powered-on
virtual machine with thin-provisioned disks. FT cannot be enabled on such a
virtual machine while it is powered on. Power off the virtual machine, then
turn on FT and power the virtual machine back on. This changes the disk
format of the virtual machine when it is powered back on. Turning on FT could
take some time to complete if the virtual disk is large.
|
Operational Errors
This table lists error messages you might encounter while using fault tolerant virtual machines:
Operational Errors
Error Message
|
Description and Solution
|
No suitable host can be found to place the Fault Tolerance
Secondary VM for virtual machine {vmName}
|
FT requires that the hosts for the Primary and Secondary
virtual machines use the same CPU model or family and have the same FT
version number or host build number and patch level. Enable FT on a virtual
machine registered to a host with a matching CPU model or family within the
cluster. If no such hosts exist, you must add one.
|
The Fault Tolerance Secondary VM was not powered on
because the Fault Tolerance Primary VM could not be powered on.
|
vCenter Server will report why the primary could not be
powered on. Correct the conditions and then retry the operation.
|
Operation to power On the Fault Tolerance Secondary VM for
{vmName} could not be completed within {timeout} seconds
|
Retry the Secondary virtual machine power on. The timeout
can occur because of networking or other transient issues.
|
vCenter disabled Fault Tolerance on VM {vmName} because
the Secondary VM could not be powered on
|
To diagnose why the Secondary virtual machine could not be
powered on, see vSphere Troubleshooting.
|
Resynchronizing Primary and Secondary VMs
|
Fault Tolerance has detected a difference between the
Primary and Secondary virtual machines. This can be caused by transient
events which occur due to hardware or software differences between the two
hosts. FT has automatically started a new Secondary virtual machine, and no
action is required. If you see this message frequently, you should alert
support to determine if there is an issue.
|
The Fault Tolerance configuration of the entity
{entityName} has an issue: No configuration information for the virtual
machine
|
vCenter Server has no information about the configuration
of the virtual machine. Determine if it is misconfigured. You can try
removing the virtual machine from the inventory and re-registering it.
|
Cannot change the vSphere HA settings for Fault Tolerance
Secondary VM {vmName}
|
The vSphere HA settings for a Secondary virtual
machine cannot be changed, because it has the same settings as its
Primary virtual machine. Always change only the settings of the Primary
virtual machine.
|
Cannot change the DRS behavior for Fault Tolerance
Secondary VM {vmName}.
|
You cannot change the DRS behavior of a Secondary virtual
machine. This configuration is inherited from the Primary virtual machine.
|
Virtual machines in the same Fault Tolerance pair cannot
be on the same host
|
You have attempted to migrate a Secondary virtual
machine to the same host a Primary virtual machine is on. A Primary virtual
machine and its Secondary virtual machine cannot reside on the same host.
Select a different destination host for the Secondary virtual machine.
|
Cannot add a host with virtual machines that have Fault
Tolerance turned On to a non-HA enabled cluster
|
FT requires the cluster to be enabled for vSphere HA. Edit
your cluster settings and turn on vSphere HA.
|
Cannot add a host with virtual machines that have Fault
Tolerance turned On as a stand-alone host
|
Turn off Fault Tolerance before adding the host as a
standalone host to vCenter Server. To turn off FT, right-click each virtual
machine on the host and select Turn Off Fault Tolerance. Then you can
add the host as a stand-alone host.
|
Cannot set the HA restart priority to 'Disabled' for the
Fault Tolerance VM {vmName}.
|
This setting is not allowed for an FT virtual machine. You
only see this error if you change the restart priority of an FT virtual
machine to Disabled.
|
Host already has the recommended number of {maxNumFtVms}
Fault Tolerance VMs running on it
|
To power on or migrate more FT virtual machines to this
host, either move one of the existing Fault Tolerance virtual machines to
another host or disable this restriction by setting the vSphere HA advanced
option das.maxftvmsperhost to
0.
|
Operations to test Fault Tolerance by terminating the
primary VM or secondary VM are not allowed for the Fault Tolerance VM
{vmName} at this time, because it is not protected by vSphere HA yet and
therefore no action will be taken to recover Fault Tolerance protection for
this VM
|
You tried to test failover functionality or attempted the
Restart Secondary task on a virtual machine that is not protected by
vSphere HA. Do not attempt these tasks until the virtual machine is
protected by vSphere HA.
|
SDK Operational Errors
This table lists error messages you might encounter while using the SDK to perform operations:
SDK Operational Errors
Error Message
|
Description and Solution
|
This operation is not supported on a Secondary VM of a
Fault Tolerant pair
|
An unsupported operation was performed directly on the
Secondary virtual machine using the API. FT does not allow direct interaction
with the Secondary virtual machine (except for relocating or migrating it to
a different host).
|
The Fault Tolerance configuration of the entity
{entityName} has an issue: Secondary VM already exists
|
The Primary virtual machine already has a Secondary
virtual machine. Do not attempt to create multiple Secondary virtual machines
for the same Primary virtual machine.
|
The Secondary VM with instanceUuid '{instanceUuid}' has
already been enabled
|
An attempt was made to enable FT for a virtual machine on
which FT was already enabled. Typically, such an operation would come from an
API.
|
The Secondary VM with instanceUuid '{instanceUuid}' has
already been disabled
|
An attempt was made to disable FT for a Secondary VM on
which FT was already disabled. Typically, such an operation would come from
an API.
|
VMware Fault Tolerance FAQ
1. What
is VMware Fault Tolerance?
VMware Fault
Tolerance is a feature that allows a new level of guest redundancy.
2. How
do I turn it on?
The feature is
enabled on a per virtual machine basis.
3. What
happens when I turn on Fault Tolerance?
In very general
terms, a second virtual machine is created to work in tandem with the virtual
machine you have enabled Fault Tolerance on. This virtual machine
resides on a different host in the cluster, and runs in virtual lockstep with
the primary virtual machine. When a failure is detected, the second
virtual machine takes the place of the first one with the least possible
interruption of service.
4. Why
can't I turn Fault Tolerance on?
VMware Fault
Tolerance can be enabled on any virtual machine that resides in a cluster that
meets the necessary requirements.
5. How
do I turn Fault Tolerance off?
Instructions for
disabling Fault Tolerance can be found in the article in Disabling or Turning Off VMware FT (1008026).
6. How
do I tell if my environment is ready for Fault Tolerance?
The VMware SiteSurvey
Tool is used to check your environment for compliance with VMware Fault
Tolerance. It can be downloaded at http://www.vmware.com/download/shared_utilities.html.
7. Where
do I find the product's website?
VMware has a website
for the Fault Tolerance product available online here at http://www.vmware.com/products/fault-tolerance/.
8. What
happens during a failure?
When a host running
the primary virtual machine fails, a transparent failover occurs to the
corresponding secondary virtual machine. During this failover, there is no
data loss or noticeable service interruption. In addition, VMware
HA automatically restores redundancy by restarting a new secondary virtual
machine on another host. Similarly, if the host running the secondary virtual
machine fails, VMware HA starts a new secondary virtual machine on a
different host. In either case there is no noticeable outage by an end
user.
9. What
is the logging time delay between the Primary and Secondary Fault Tolerance
virtual machines?
The actual delay is
based on the network latency between the Primary and Secondary. vLockstep
executes the same instructions on the Primary and Secondary, but because this
happens on different hosts, there could be a small latency, but no loss of
state. This is typically less than 1 ms. Fault Tolerance includes
synchronization to ensure that the Primary and Secondary are synchronized.
10. In
a cluster with more than 3 hosts, can you tell Fault Tolerance where to put the
Fault Tolerance virtual machine or does it chose on its own?
You can place the
original (or Primary virtual machine). You have full control with DRS or
VMotion to assign to it to any node. The placement of the Secondary, when
created, is automatic based on the available hosts. But when the
secondary is created and placed, you can VMotion it to the preferred host.
11. What
happens if the host containing the primary virtual machine comes back
online (after a node failure)?
This node is put back
in the pool of available hosts. There is no attempt to start or migrate the
primary to that host.
12. Is
the failover from the primary virtual machine to the
secondary virtual machine dynamic or does Fault Tolerance restart a
virtual machine?
The failover from
primary to secondary virtual machine is dynamic, with the secondary continuing
execution from the exact point where the primary left off. It happens automatically
with no data loss, no downtime, and little delay. Clients see no interruption.
After the dynamic failover to the secondary virtual machine, it becomes the new
primary virtual machine. A new secondary virtual machine is spawned
automatically
13. Where
are Fault Tolerance failover events logged?
All failover events
are logged by vCenter.
14. Does
Fault Tolerance support Intel Hyper-Threading Technology?
Yes, Fault Tolerance does support Intel Hyper-Threading Technology on systems that have it enabled. Enabling or disabling Hyper-Threading has no impact on Fault Tolerance.
Yes, Fault Tolerance does support Intel Hyper-Threading Technology on systems that have it enabled. Enabling or disabling Hyper-Threading has no impact on Fault Tolerance.
15. What
happens if vCenter Server is offline when a failover event occurs?
Once Fault Tolerance
is configured for a virtual machine, vCenter Server need not be online for FT
to work. Even if vCenter Server is offline, failover will still occur from the
primary to the secondary virtual machine. Additionally, the spawning of a new
secondary virtual machine will also occur without vCenter Server.