VMworld 2015: vCenter Server HA

Session INF4945

Why is vCenter HA important?

  • Primary administrative console
  • Critical component in end-to-end cloud provisioning
  • Foundation for VDI
  • Backup and DR solutions rely on vCenter
  • vCenter target availability is 99.99% from VMware’s design perspective (5 min a month)


Make every layer of the vCenter stack HA

  • vCenter DB
  • Host
  • SAN
  • Network
  • DC power and cooling

Reduce dependencies to improve nines

  • In moving from 5.1 and 5.5 to 6.0 you see a consolidation of vCenter services into VMs (e.g. just PSC and vCenter in 6.0)
  • vCenter 5.5 U3 supports SQL AAGs
  • vCenter 6.0 U1 supports SQL AAGs

Hardware/Host Failure protection: vSphere HA

  • Test tested solution
  • Protects against hardware failures
  • Some downtime for failover
  • Easy to setup and manage
  • DRS rules can be leveraged
  • High restart priority for vCenter components

Hardware/host failure protection: vSphere FT

  • Continuous availability with zero downtime and data loss
  • vCenter tested with FT for 4 vCPUs or less (only the ‘tiny’ and ‘small’ deployments fit)
  • About 20% overhead
  • Downtime during guest OS patching

Application failure protection: Watchdog

  • Watchdog monitors and protects vCenter applications
  • Automatically enabled on install on VCSA and Windows
  • On failure watchdog attempts to restart processes, if restart fails then VM is rebooted
  • Separate watchdog per vCenter server component

Application failure protection: Windows Server Failover clustering

  • Provides protection against OS level and application downtime
  • Provides protection for database
  • Some downtime during failure
  • Reduces downtime during OS patching
  • Tested with vCenter 5.5 and 6.0

Platform Services Controller HA

  • Two models: Embedded PSC or external PSC
  • PSC high availability in 6.0 requires a third party load balancer (removed in future vSphere versions)
  • Multiple PSC nodes in same site

vCenter Backup

  • Backup both embedded PSC and external PSC configurations
  • Recover from failures to vCenter node, PSC node or both
  • When vCenter node restored, it connects to PSC and reconciles the differences
  • When PSC node restored, it replicates from the other nodes
  • Uses VADP
  • Out-of-the box integration with VMware VDP

Tech Preview (vSphere 6.1?): Native HA

  • Native active-passive HA
  • Uses witness
  • No third party technology needed
  • Recover in minutes (target is 15 minutes), not hours
  • Protects against hardware, host and application failures
  • No shared storage required
  • 1-click automated HA setup
  • Fully integrated into the product
  • Out of box for the VCSA

VMworld 2015: 5 Functions of SW Defined Availability

Session: INF4535

Duncan Epping, Frank Denneman

Introduction to SDA (Software defined availability): VM, server, storage, data center, networking, management. Business only cares about the application, not the underlying infrastructure.

vSphere HA

  • Configured through vCenter but not dependent on it
  • Each host has an agent (FDM) will be installed for monitoring state
  • HA restarts VMs when failure impacts those VMs
  • Heartbeats via network and storage to communicate availability
  • Can use management network or VSAN network if VSAN is enabled
  • Need spare resources
  • Admission control – Allows you to reserve resources in case of a host failure
  • Admission control guarantees VM receives their reserved resources after a restart, but does not guarantee that VMs perform well after a restart.
  • Best practices: Select policy that best meets your needs, enable DRS, simulate failures to test performance
  • Percentage based is by far the most used and is Duncan recommended
  • Duncan went through various failure scenarios (host failure, host isolation, storage failure) and how HA restarts the VMs.
  • Use VMCP (new in 6.0) [VM component protection]. Helps protects against storage connectivity loss.
  • Generic recommendations: disable “host monitoring”; make sure you have redundant management network; enable portfast; use admission control


  • DRS provides load balancing and initial placement
  • DRS is the broker of resources between producers and consumers
  • DRS goal is to provide the resources the VM demands
  • DRS provides cluster management (maintenance mode, affinity/anti-affinity rules)
  • DRS keeps VM’s happy, it doesn’t perfectly balance each host
  • DRS affinity rules: Control the placement of VMs on hosts within a cluster.
  • DRS highest priority is to solve any violation of affinity rules.
  • VM-host groups configureable in mandatory (must-rule) or preferential (anti-)affinity rules (should-rule)
  • A mandatory (must) rule limits HA, DRS and the user
  • Why use resource pools? Powerful abstraction for managing a group of VMs. Set business requirements on a resource pool.
  • Bottom line is resource pools are complex, and VMs may not get the resources you think they should. Only use them when needed.
  • Try to keep the affinity rules as low as possible. Attempt to use preferential rules.
  • Tweak aggressiveness slider if cluster is unbalanced.


  • Storage IO control is not cluster aware, it is focused on storage
  • Enabled at the datastore level
  • Detects congestion and monitors average IO latency for a datastore
  • Latency above a particular threshold indicates congestion
  • SIOC throttles IOs once congestion is detected
  • Control IOs issued per host
  • Based on VMs shares, reservations, and limits
  • SDRS runs every 8 hours and checks balance, and looks at previous 16 hours for 90th percentile
  • Capacity threshold per datastore
  • I/O metric threshold per datastore
  • Affinity rules are available
  • SDRS is now aware of storage capabilities through VASA 2.0 (array thin provisioning, dedupe, auto-tiering, snapshot)
  • SDRS integrated with SRM
  • Full vSphere replication full support


  • Migrate live VM to a new compute resource
  • vSphere 6.0: cross vCenter vMotion, long-distance vMotion, vMotion to cloud
  • May not realize it, but lots of innovation and new features here since its introduction in 2003
  • Long distance vMotion supports up to 150ms. No WAN acceleration needed.
  • vMotion anywhere: vMotion cross-vCenters, vMotion across hosts without shared storage, easily move VMs across DVS, folders and datacenters.

vSphere Network IO Control

  • Outbound QoS
  • Allows you to partition network resources
  • Uses resource pools to differentiate between traffic types (VM, NFS, vMotion, etc.)
  • Bandwidth allocation: Shares and reservations. NIOC v3 allows configuration of bandwidth requirements for individual VMs
  • DRS is aware of network reservations as well.
  • Bandwidth admission control in HA
  • Set reservations to guarantee minimum amount of bandwidth for performance of critical network traffic. Sparingly use VM level reservations.


VMworld 2014: vSphere HA Best Practices and FT Preview

Session BCO2701. This was very fast paced, and I missed jotting down a lot of the slide content. If you attended VMworld then I recommend you listen to the recording to get all of the good bits of information.

vSphere HA – what’s new in 5.5

  • VSAN Support
  • AppHA integration

What is HA? Protects against 3 failures:

  • Host failures, VM crashes
  • host network isolated and datastore incurs PDL
  • Guest OS hangs/crashes and application hangs/crashes

Best Practices for Networking and Storage

  • Redundant HA network
  • Fewest possible hops
  • Consistent portgroup names and network labels
  • Route based on originating port ID
  • Failback policy = no
  • Enable PortFast
  • MTU size the same

Networking Recommendations

  • Disable host monitoring if network maintenanceis going on
  • vmkinics for vSphere HA on separate subnets
  • Specify additional network isolation addresses
  • Each host can communicate with all other hosts

Storage Recommendations

  • Storage Heartbaeats – All hosts in the cluster should see the same datastores

Best Practices for HA and VSAN

  • Heartbeat datastores are not necessary in a VSAN cluster
  • Add a non-VSAN datastore to cluster hosts if VM MAC address collisions on the VM network are a significant concern
  • Choose a datastore that is fault isolated from VSAN network
  • Isolation address – use the default gateways for the VSAN networks
  • Each VSAN network should be on a unique subnet

vSphere HA Admission Control

  • Select the appropriate admission control policy
  • Enable DRS to maximize likelihood that Vm resource demands are met
  • Simulate failures to test and assess performance
  • Use the impact assessment fling
  • Percentage based is often the best choice but need to recalculate when hosts are added/removed

Tech Previews of  FT

  • FT will support up to v 4CPUs and 64GB of RAM per VM
  • FT now uses separate storage for the primary and secondary VMs
  • New FT method does not keep CPUs in lock step, but relies on super fast check pointing

Tech Preview HA

  • VM Component protection for storage is a new feature
  • Problem: Detects APD and PDL situation
  • Solution: Restarts affected VMs on unaffected hosts
  • Shows a GUI with options for what you want to protect against

Tech Preview of Admission control fling

  • Assesses the impact of losing a host
  • Provides sample scenarios to simulate