VMworld 2012: Avoiding 19 Biggest HA & DRS Mistakes INF-VSP1232

Greg Shields, Concentrated Technology partner. This session focused on the main HA/DRS mistakes that people make when virtualizing their infrastructure. Greg is a great speaker and also has presented sessions at TechEd and other conferences. HA/DRS settings are so easy to set and forget, or forget to set, that everyone should review all 19 mistakes and make sure you aren’t doing them in your environment.

HA/DRS solves two problems: Protection from unplanned downtime; Load Balancing and defragmentation of resources.
Large number of environments have configured HA/DRS settings incorrectly.
Mistake #1: Not planning for HW evolution.

vMotion requires similar processors.
Always set EVC mode

Mistake #2: Not planning for svMotion

VMs cannot have snapshots
VM disks must be persistent mode or RDMs
Host must have sufficient resources to support two instances of the VMs running concurrently.
Must be licensed and correctly configured with vMotion
Host must have access to both source and target datastores

Mistake #3: Not Enough Cluster Hosts

For HA failover requires additional “wasted” hardware resources
Must plan for cluster reserve
A fully prepared cluster must set aside one full server’s worth of resources in preparation for HA.
Enable admission control – Super important to enable. Will disallow starting VMs when resources are exhausted.
Set host failures cluster tolerates to 1 (or more). Ensures you always have at least one hosts’s worth of resources available.

Mistake #4: Setting Host Failures the Cluster Tolerates to 1

Not all your VMs are priority one
Some VMs can stay down if a host dies
Can set the % to less than one server’s worth of resources, since not all VMs need to restart if a host fails.

Mistake #5: Forgetting to Prioritize VM restart

VM restart priority is one of those oft-forgotten settings
Come into play when Percentage policy is enabled
Restart policy is per-host
Per-VM settings must be configured for each VM
This can create a problem down the road, as VMs may restart in the wrong order

Mistake #6: Disabling Admission Control

Many young admins may turn it off and forget about it
Never disable admission control!!!

Mistake #7: Not updating Percentage Policy

Needs to be adjusted as your cluster size changes
Host failures the cluster tolerates needs no adjusting

Mistake #8: Buying (the occasional) Big Server

Host failures the cluster tolerates sets aside the amount of resources the protect every server.
It must set aside resources equal to your biggest server in the cluster

Mistake #9: Neglecting Host Isolation Response

Current recommendation is to leave powered on
On converged networks you may not want to use powered on
Heartbeat datastores – Adds redundancy

Mistake #10: Assuming that Datastore heartbeats Prevent isolation Events

Master determines the state of the unresponsive host
Isolation response is triggered by the slave

Mistake #11: Confusing your ADP with your PDL

An All Point Down scenario exists when all communication is severed between host and device
I/O is then queued until a SCSI response code officially reports the link is down
This can lead to infinite queuing of device I/O
Permanent device loss scenario exists when the host can see the device target but the target isn’t listening

Lets the host recognize the I/O

APD is a more common scenario and APD will not trigger vSphere HA
Look at new settings in 5.0 U1 and 5.1 – Most handy for metro clusters

Mistake #12: Overdoing Reservations, limits, and Affinities

HA may not consider these “soft affinities” at failover
Consider using shares over reservations and limits

Less impact on DRS and thus HA

Mistake #13: Considering Using Shares without Considering using Shares

Shares are only considered during periods of contention
But settings shares on resource pools can have unexpected results
Don’t treat resource pools like folders

Mistake #14: Doing memory limits at all

Don’t assign memory limits. Ever.
Limit memory closest to the application as possible (such as in the SQL app)

Mistake #15: Thinking you are smarter than DRS

Not using fully automated mode

Mistake #16: Not understanding DRS’ Equations

Every 5 minutes a DRS interval is invoked
Takes into account VM entitlements, host capacity

Mistake #17: Being too liberal with your migration threshold

Pri 1 recommendations are mandatory

Mistake #18: Combining VDI and Server Workloads in the same cluster

ESXi hosts running VDI workloads tend to experience more load than running server workloads.
VDI forces DRS to work harder and more often
Create separate clusters for VDI and everything else