VMworld 2017: vSphere 6.5Host Resources Deep Dive Pt. 2

Session: SER1872BU Frank Denneman, Niels Hagoort

Note: This was a highly technical session with lots of diagrams. Best bet is to get Frank and Niel’s book for all the details.

Compute Architecture: Shows a picture of a two NUMA node server. Prior to Skylake processors, two DIMMSs per memory channel are optimal. Skylake processors increased the number of memory channels and have a maximum of 2 DIMMS per channel.

QPI Memory performance: 75ns local latency, but 132ns latency to other NUMA node

Quad channel local memory access: 76GB/s. Remote access will be noticeably slower.

vNUMA exposes the physical NUMA architecture to a VM. vNUMA ‘kicks in’ when a VM has more than 8 vCPUs and if the core count exceeds the physical CPU package. ESXi will then evenly split the vCPUs across the two physical CPU packages.

If you use virtual socks, mimic the physical CPU package layout as much as possible. This allows the OS to optimally manage memory and the cache.

“PreferHT” can be useful, see KB 2003582. This forces the NUMA scheduler to count hyperthreads as cores. Use this setting when a VM is more memory intensive vs. CPU intensive.

What if the vCPUs can fit in a socket, but VM memory cannot? numa.consolidate=FALSE can be useful.

One AHCI storage IO needs 27K CPU cycles. If you want a VM do do 1M IOPS, you need 27GHz of CPU power.

NVMe 1 I/O needs 9.1K CPU cycles, which is vastly less than AHCI storage.

With 3D crosspoint, it can max I/O performance at a very low queue depth. This makes it quite useful as a caching tier in vSAN.

CPU Utilization vs. Latency

Workload latency sensitive? No, then tune CPU for power savings. Yes, then tune for lowest latency. SAP HANA, for example, could benefit from low latency.

Interrupt coalescing, is enabled by default on all modern NICs. This can increase packet latency. You can increase ring buffers by using KB2039495, which can help with dropped packets.

Polling vs. interrupts

Pollmode driver (DPDK) can optimize network I/O performance.

Low CPU utilization = higher latency
Higher CPU utilization = lower latency

vSphere 6.5 as vRDMA, which can significantly boost network throughput.