Greg Shields, Concentrated Technology partner. This session focused on the main HA/DRS mistakes that people make when virtualizing their infrastructure. Greg is a great speaker and also has presented sessions at TechEd and other conferences. HA/DRS settings are so easy to set and forget, or forget to set, that everyone should review all 19 mistakes and make sure you aren’t doing them in your environment.
- HA/DRS solves two problems: Protection from unplanned downtime; Load Balancing and defragmentation of resources.
- Large number of environments have configured HA/DRS settings incorrectly.
- Mistake #1: Not planning for HW evolution.
- vMotion requires similar processors.
- Always set EVC mode
- Mistake #2: Not planning for svMotion
- VMs cannot have snapshots
- VM disks must be persistent mode or RDMs
- Host must have sufficient resources to support two instances of the VMs running concurrently.
- Must be licensed and correctly configured with vMotion
- Host must have access to both source and target datastores
- Mistake #3: Not Enough Cluster Hosts
- For HA failover requires additional “wasted” hardware resources
- Must plan for cluster reserve
- A fully prepared cluster must set aside one full server’s worth of resources in preparation for HA.
- Enable admission control – Super important to enable. Will disallow starting VMs when resources are exhausted.
- Set host failures cluster tolerates to 1 (or more). Ensures you always have at least one hosts’s worth of resources available.
- Mistake #4: Setting Host Failures the Cluster Tolerates to 1
- Not all your VMs are priority one
- Some VMs can stay down if a host dies
- Can set the % to less than one server’s worth of resources, since not all VMs need to restart if a host fails.
- Mistake #5: Forgetting to Prioritize VM restart
- VM restart priority is one of those oft-forgotten settings
- Come into play when Percentage policy is enabled
- Restart policy is per-host
- Per-VM settings must be configured for each VM
- This can create a problem down the road, as VMs may restart in the wrong order
- Mistake #6: Disabling Admission Control
- Many young admins may turn it off and forget about it
- Never disable admission control!!!
- Mistake #7: Not updating Percentage Policy
- Needs to be adjusted as your cluster size changes
- Host failures the cluster tolerates needs no adjusting
- Mistake #8: Buying (the occasional) Big Server
- Host failures the cluster tolerates sets aside the amount of resources the protect every server.
- It must set aside resources equal to your biggest server in the cluster
- Mistake #9: Neglecting Host Isolation Response
- Current recommendation is to leave powered on
- On converged networks you may not want to use powered on
- Heartbeat datastores – Adds redundancy
- Mistake #10: Assuming that Datastore heartbeats Prevent isolation Events
- Master determines the state of the unresponsive host
- Isolation response is triggered by the slave
- Mistake #11: Confusing your ADP with your PDL
- An All Point Down scenario exists when all communication is severed between host and device
- I/O is then queued until a SCSI response code officially reports the link is down
- This can lead to infinite queuing of device I/O
- Permanent device loss scenario exists when the host can see the device target but the target isn’t listening
- Lets the host recognize the I/O
- APD is a more common scenario and APD will not trigger vSphere HA
- Look at new settings in 5.0 U1 and 5.1 – Most handy for metro clusters
- Mistake #12: Overdoing Reservations, limits, and Affinities
- HA may not consider these “soft affinities” at failover
- Consider using shares over reservations and limits
- Less impact on DRS and thus HA
- Mistake #13: Considering Using Shares without Considering using Shares
- Shares are only considered during periods of contention
- But settings shares on resource pools can have unexpected results
- Don’t treat resource pools like folders
- Mistake #14: Doing memory limits at all
- Don’t assign memory limits. Ever.
- Limit memory closest to the application as possible (such as in the SQL app)
- Mistake #15: Thinking you are smarter than DRS
- Not using fully automated mode
- Mistake #16: Not understanding DRS’ Equations
- Every 5 minutes a DRS interval is invoked
- Takes into account VM entitlements, host capacity
- Mistake #17: Being too liberal with your migration threshold
- Pri 1 recommendations are mandatory
- Mistake #18: Combining VDI and Server Workloads in the same cluster
- ESXi hosts running VDI workloads tend to experience more load than running server workloads.
- VDI forces DRS to work harder and more often
- Create separate clusters for VDI and everything else
- Mistake #19: Planning on Overcommit
- Over commit creates extra work for the hypervisor
- Assign the right amount of memory to your VMs