VMworldl 2017: What’s new in storage

Session: SER1317BU

Faster storage need faster networks
-10/25/40 NICs are now the norm
-Protocol stack needs to be upgraded with new storage protocols
-Performance of AFA depends on network connection with low latency
-32Gbps FC is shipping, 64Gbps is coming

NVMe – A logical device interface to NVM devices
-PCIe is the physical interface and NVMe is the protocol
-up to 64K queues and 64K queue depth
-All major OSes support NVMe

-Allows large number of external NVMe drives into external storage
-Aims for no more than 10 microseconds latency overhead compared to local NVMe

vSphere 6.5 Features
-VMFS 6.0
*meta data is 4K aligned
*supports 64-bit addressing
*NO in-place upgrade from VMFS 5.0. See KB 2147824

Automatic Unmap in 6.5
-Automatic unmap support when VM is deleted/migrated
-Space reclimation requests from guest OS which supports UMAP
-Only automatic unmap on arrays with UNMAP granularity LESS than 1MB
-Background impact is minimal: set to 25MB/sec max
-Future: Possibly throttle/accelerate UNMAP rate based on array load

High Capacity Drives in 6.5
-Support 512e drives
-Requires VMFS 6
-vSphere 6.0 supports physical mode RDMs mapped to 512e drives
-FAQ: KB2091600

New Scale Limits
-512 LUNs & 2048 paths
-If using 8 paths per LUN, you can now have 256 LUNs

NFS 4.1 Plug-in and strong crypto & HW acceleration support
-NFS 4.1 supported since 6.0
-HW acceleration (VAAI) now supported
-Stronger crypto with AES
-Supports IPv6
-Better security with NFS 4.1

Virtual NMVe in 6.5
-NVMe 1.0 device emulation
-Hot add/remove
-Multi-Q support – 16 queues with 4K depth

VMworld 2017: Advanced ESXi troubleshooting

Session: SER2965BU

Note: This session had a number of log examples and what to look for. Review the session slides for all the details. EXCELLENT session!

Goal: 7 log files, 7 ESXi commands, 7 config files for enhanced troubleshooting

7 Important Log Files

-Host abruptly rebooted – vmksummary.log
-Slow boot issues – /var/log/boot.gz .  You can also enable serial logging (Shift + o)
-ESXi not responding – hostd.log & hostd-probe.log
-VM issues – vmware.log
-Storage issues – vmkernel.log
-Network and storage issues – vobd.log
-HA issues – fdm.log   /opt/vmware/fdm/prettyprint.sh hostlist | less

7 ESXi Commands
-Monitor & configure ESXi – esxcli
-VMkernel sysinfo shell command – vsish get /bios; /hardwareinfo;

-Manage ESXi * VM config – vim-cmd

-VMFS volumes & virtual disks – vmkfstools
-Detailed memory stats – memstats
-network packet capture – pktcap-uw
-monitoring – esxtop

7 Configuration Files
-/etc/vmware/esx.conf – storage, networking, HW info
-/etc/vmware/hostd/vminventory.xml – VM inventory
-/etc/vmware/hostd/authorization.xml – vCenter to ESXi host connection
-/etc/vmware/vpxa/vpxa/cfg – vCenter and ESXi connectivity
-/etc/vmware/vmkiscsid/iscsi.conf – iSCSI configuration
-/etc/vmware/fdm – HA config
-/etc/vmware/license.cfg – license configuration

VMworld 2017: DR with VMware on AWS

Session: MMC2455BU, GS Khalsa

Legacy (physical) DR solutions are not adequate – Long RTOs, lots of surprises, unreliable
vSphere is an enabler for DR – consolidation, hardware independence, encapsulation (VM is a file)

Long distance DR solutions with async replication
-bi-directional failover
-Shared site recovery

Metro DR Solutions with sync replication
-Availabiity – Zero RPO/RTO
-Mobility – active/active datacenters
-Disaster avoidance

DR to the cloud with AWS
-Co-located DR costs are high
-DR to the cloud is less expensive

VMware Cloud on AWS
-Managed SDDC stack running on AWS
-Consistent operational model enables hybrid cloud
-Leverage cloud economics
-Goals of DR: Deliver as a service, build on VMware (SRM, vSphere replication, etc.)
-Working on flexible SRM pairing – Decouple on-site upgrade from VMC/AWS
-Loosening version dependencies across vCenter, SRM & vSphere Replication releases
-Working on major UI improvements – HTML5 and “clarity” UI standard
NEW: SRM Appliance based on photon OS

GS then shows a number of video demos showing the full SRM configuration, setup, and failover process. Anyone familiar with SRM will be accustomed to the same workflow, but with a nice new coat of paint on the GUI.




VMworld: PowerCLI What’s New

Session: SER2529BU Alan Renouf

PowerCLI Overview
-623 cmdlets and counting
-PowerCLI is 10 years old
-Name change – VMware PowerCLI
-Move-VM now includes cross vCenter vMotion
-Automate everything with VSAN
-Independent disk management cmdlets – new-vdisk, get-vdisk, copy-vdisk, move-vdisk
-VVOL replication cmdlets
-New Horizon View module
-SPBM cmdlets
-More inventory parameters
-DRS cluster groups and VM/host rule cmdlets

Install: install-module VMware.PowerCLI

Release Frequency
-Less features, but more often
-Less wait on bug fixes
-Focused on your input

PowerCLI 6.5.2
-New ‘inventoryLocation’ parameter – move-vm, import-vapp, new-vapp
-Mount a content library ISO with new-CDDrive
-Fixes and enhancements

Multiplatform Fling
-Photon OS, Mac OS, Linux, Docker

VMware Cloud on AWS?
-Works exactly the same as on-site vCenter

Endless Possibilities
-Content library – more cmdlets to come
-Parameter auto-complete
-vSphere REST API high-level cmdlets
-Powershell DSC (desired state config) – Chef, Puppet, Ansible, Saltstack
-New vSphere Client and Rest API support for Onyx (automated code generator)
-PowerCLI multiplatform 6.0

Community Projects
(FREE) OpBot – Connects vCenter to slack. Download: http://try.opvizor.com/opbot
(NEW!!) PowerCLI Feature request page: https://vmwa.re/powercli


VMworld 2017: Architecting Horizon 7 & Apps

Session: ADV1588BU

Note: This session had a multitude of complex architecture diagrams which I did not capture. See the session slide deck, after VMworld, for all the details.

Why? –Business objective/drivers
How? Meet requirements
What? Design and build
Deliver Build and integrate
Validate Met requirements?

Design Steps

  1. Business drivers & use case definition
  2. Services definition
  3. Architecture principles and concept
  4. Horizon 7 component design
  5. vSphere 6 design
  6. Physical environment design
  7. Services integration
  8. User experience design

Use a repeatable model when scaling up:


Physical Environment Considerations

Identity Management

Profiles and User Data
-Folder redirection
-Mandatory profile
-User environment manager

-AppStack replication
-Single site or multiple site
-Use writeable volumes very sparingly

VMware Horizon Apps

Speaker goes over a highly detailed reference architecture with lots of complex slides. And he goes over the LoginVSI setup, both hardware and software.

VMworld 2017: vSphere 6.5Host Resources Deep Dive Pt. 2

Session: SER1872BU Frank Denneman, Niels Hagoort

Note: This was a highly technical session with lots of diagrams. Best bet is to get Frank and Niel’s book for all the details.

Compute Architecture: Shows a picture of a two NUMA node server. Prior to Skylake processors, two DIMMSs per memory channel are optimal. Skylake processors increased the number of memory channels and have a maximum of 2 DIMMS per channel.

QPI Memory performance: 75ns local latency, but 132ns latency to other NUMA node

Quad channel local memory access: 76GB/s. Remote access will be noticeably slower.

vNUMA exposes the physical NUMA architecture to a VM. vNUMA ‘kicks in’ when a VM has more than 8 vCPUs and if the core count exceeds the physical CPU package. ESXi will then evenly split the vCPUs across the two physical CPU packages.

If you use virtual socks, mimic the physical CPU package layout as much as possible. This allows the OS to optimally manage memory and the cache.

“PreferHT” can be useful, see KB 2003582. This forces the NUMA scheduler to count hyperthreads as cores. Use this setting when a VM is more memory intensive vs. CPU intensive.

What if the vCPUs can fit in a socket, but VM memory cannot? numa.consolidate=FALSE can be useful.

One AHCI storage IO needs 27K CPU cycles. If you want a VM do do 1M IOPS, you need 27GHz of CPU power.

NVMe 1 I/O needs 9.1K CPU cycles, which is vastly less than AHCI storage.

With 3D crosspoint, it can max I/O performance at a very low queue depth. This makes it quite useful as a caching tier in vSAN.

CPU Utilization vs. Latency

Workload latency sensitive? No, then tune CPU for power savings. Yes, then tune for lowest latency. SAP HANA, for example, could benefit from low latency.

Interrupt coalescing, is enabled by default on all modern NICs. This can increase packet latency. You can increase ring buffers by using KB2039495, which can help with dropped packets.

Polling vs. interrupts

Pollmode driver (DPDK) can optimize network I/O performance.

Low CPU utilization = higher latency
Higher CPU utilization = lower latency

vSphere 6.5 as vRDMA, which can significantly boost network throughput.

VMworld 2017: Virtualizing AD

Session: VIRT1374BU: Matt Liebowitz

AD Replication
-Update sequence number (USN) tracks updates and are globally unique
-InvocationID – Identifies DC’s instance in the AD database
-USN + InvocationID = Replicable transaction

Why Virtualize AD?
-Fully supported by Microsoft
-AD is friendly towards virtualization (low I/O, low resource)
-Physical DCs waste resources

Common objections to virtualizing DCs
-Fear of stolen vmdk
-Privilege escalation – VC admins do not need to be domain admins and vice versa
-Must keep xx role physical – no technical or support reason. Myth
-Timekeeping is hard in VMs

Time Sync
-VM guest will get time re-set with vMotion and resuming from suspend. If there’s a ESXi host with bad time/date, it can cause weird “random” problems when DRS moves DCs around.
-There’s a set of ~8 advanced VMX settings to totally disable time sync from guest to ESXi host. Recommended for AD servers. See screenshot below.

Virtual machine security and Encryption
-vSphere supports VMDK encryption
-Virtualization based security – WS2016 feature – supported in future vSphere version

Best Practices

Domain Controller Sizing
USN Rollback
Happens when a DC is sent back into time (e.g. snapshot rollback)
-DCs can get orphaned if this happens since replication is broken
-If this happens, it’s a support call to MS and a very long, long process to fix it

VM Generation ID
-A way for the hypervisor to expose a 128-bit generation ID to the VM guest
-Need vSphere 5.0 U2 or later
-Active Directory tracks this number and prevents USN rollback
-Can be used for safety and VM cloning

Domain Controller Cloning
-Microsoft has an established process to do this, using hypervisor snapshots.
-Do NOT hot clone your DCs! Totally unsupported and will cause a huge mess.

VMworld 2017: Extreme Performance

Session: SER2724BU

Performance Best Practice Guide for vSphere 6.5 guide is now out. Download now!

Baseline best practices
-Use the most current release
-HW selection makes a difference
-Refer to best practice guides
-Evaluate power management
-Rightsize your workloads
-Keep hyperthreading enabled
-Use DRS to manage contention
-Do NOT use resource pools – more harm than good
-Monitor oversubscription
-Use paravirtualized drivers

-Compute: Contention – CPU ready, co-stop
-Memory: Oversubscription – balloon, swap
-Storage: Service time – device and kernel latency

-Poor NUMA locality (N%L)
-pNUMA does not match vNUMA
-VM config should match physical topology (don’t make wide VMs)
-Don’t create a VM with a larger vCore count than pCores

Keep things up to date
-Virtual hardware can make a performance difference
-38 changes were made in vHW 11 alone
-Use latest vHW

Power Management
-New in 6.5 is %A/MPERF in ESXtop to see power management. Over 100% means turbomode.
-“Balanced” mode allows turbomode
-Always set BIOS to “os controlled”
-High performance caps turbo opportunity – good for large VMs – required for latency sensitive workloads
-“high performance mode” should be used for benchmarking since it results in the most stable results

-25% more performance, approximately
-Latest processes may be higher performance

VMworld 2017: vSphere SSO Architecture

Session: SER2940BU. Speakers: Emad Younis, Adam Eckerle

Embedded PSC: Totally supported for production usage. It’s not just test/dev. Use this model if you don’t need enhanced linked mode. This is a simple model, and use it if it supports your needs.

External PSC: Allows linking of vCenters via linked mode. Tags, roles, global permissions, licensing all replicate throughout the entire SSO domain. Up to 15 vCenters can point to a single PSC in 6.5 U1. Not recommended, but you can do it.

In vSphere 5.5 you can consolidate SSO domains. So consolidate BEFORE you deploy any 6.x versions. After you deploy any 6.x component, you are locked into your SSO domains. If doing this merge, make sure you un-install/remove the embedded SSO component before you upgrade to vSphere 6.x.

Within an SSO domain, you can’t mix versions of products. So if you have islands of vCenters, you may NOT want them linked together. This will require that you upgrade everything together. Very applicable to vBlock environments and their islands of vCenters.

A site is a logical grouping of PSCs. PSCs are multi-master and replicate every 30 seconds.

Recommendation: If you have multiple PSCs spread across multiple sites, you can optionally use “vdcrepadmin” to add more replication agreements. Do NOT add just for the sake of adding. Only add agreements if absolutely needed.

In vSphere 6.5 you can only repoint a vCenter intrasite to another PSC (not across sites). Refer to “cmsso-util”. This is not allowed due to the added latency and causing performance issues.

VMware recommends a max of 100ms between PSCs in the “same” logical site. VMware will support all PSCs in the same site, but it’s not recommended. VMware does not want vCenters talking to remote PSCs.

There’s no current method to migrate from a Windows vCenter with an external PSC o the VCSA with an embedded PSC. VMware said in the future this scenario may be possible.

You can NOT move a vCenter from one SSO domain to another (today).

Built-in SSO load balancing is possibly in a future vSphere release. No third party LB needed, such as F5 or NetScaler.

If you globally want to deploy multiple vCenters, don’t do a global SSO domain. It can be a disaster. Setup regional SSO domains for best performance.

VMworld 2017: Predictive DRS Best Practices

Session: SER2849BU

Case 1: VMs performance can suffer due to resource constraints/surges

Case 2: Inefficient usage of resources due to reserving capacity for peak loads.

-Move VMs after contention occurs

-Statically reserve more resource
-Learn workload pattern, and move before VMs spike

What is the best solution? Predictive DRS

What is Predictive DRS?
-DRS enabled with predictions
-DRS scheduling + vROPs analytics

How does it work?
-Resource usage from vCenter
-vROPs consumes the data
-Predictions are made
-DRS invoked to perform optimizations

vROps Dynamic Thresholds (DT)
-Sophisticated analytics – 10 algorithms
-Learns normal behavior
-Detects hourly, daily, monthly patterns
-Generates upper and lower dynamic thresholds
-Predictions are then sent to vCenter

Software Requirements
-vSphere 6.5 Enterprise Plus
-vROps 6.4 or 6.5
-Time sync between vCenter and vROps needs to be less than 5 minutes

Speaker shows a demo of  a ‘follow the sun’ scenario with workloads spiking at different times on a regular pattern. pDRS learned the pattern, and vMotioned VMs to make sure VMs had enough resources. He shows a performance graph, where pDRS headed off performance issues and it resulted in consistent VM performance.

DPM with Predictions
Speaker asks audience to raise hands if anyone is using DPM. Two people raise their hands.
-Predictions can proactively power up ESXi hosts to absorb the workload demand

-Workloads it can predict: Periodic usage pattern
-Short spikes of a few minutes will not be predicted
-The more consistent the workload, the more accurate it will be

Learning Period
-Set to 14 days by default
-The longer the period, the better the accuracy
-Predictions only happen after 14 days

-Compute dynamic thresholds – Calculated once a day, or push a button to force a new calculation.
-Lookahead interval – Amount of time DRS looks ahead while accounting for predictions – default is 1 hour

Identify vMotions due to Preditions
-Not a clear answer as there can be a mix of VMs with predictions and those without
-pDRS moves are only in logs