vSphere 5.0 Storage Improvements

If you a regular follower of my blog, you will probably notice I’m a bit of a storage geek. VAAI, FCoE, WWNs, WWPNs, VMFS, VASA and iSCSI are all music to my ears. So what’s new to vSphere 5.0 storage technologies? A LOT. That team must have been working over time to come up with all these great new features. Here’s a list of the high level new features, gleaned from a great VMware whitepaper that I have a link to at the end of this post.

VMFS 5.0

  • 64TB LUN support (with NO extents), great for arrays that support large LUNs like 3PAR.
  • Partition table automatically migrated from MBR to GPT, non-disruptively when grown above 2TB.
  • Unified block size of 1MB. No more wondering what block size to use. Note that upgraded volumes retain their previous block size so may want to reformat old LUNs that don’t use 1MB blocks. I use 8MB blocks, so I’ll need to reformat all volumes.
  • Non-disruptive upgrade from VMFS-3 to VMFS-5
  • Up to 30,000 8K sub-blocks for files such as VMX and logs
  • New partitions will be aligned on sector 2048
  • Passthru RDMs can be expanded to more than 60TB
  • Non-passthru RDMs are still limited to 2TB – 512 bytes

There are some legacy hold-overs if you upgrade a VMFS-3 volume to VMFS 5.0, so if at all possible I would create fresh VMFS-5 volumes so you get all of the benefits and optimizations. This can be done non-disruptively with storage vMotion, of course. VMDK files still have a maximum size of 2TB minus 512 bytes. And you are still limited to 256 LUNs per ESXi 5.0 host.

Storage DRS

  • Provides smart placement of VMs based on I/O and space capacity.
  • A new concept of a datastore cluster in vCenter aggregates datastores into a single unit of consumption for the administrator.
  • Storage DRS makes initial placement recommendations and ongoing balancing recommendations, just like it does for compute and memory resources.
  • You can configure storage DRS thresholds for utilized space, I/O latency and I/O imbalances.
  • I/O loads are evaluated every 8 hours by default.
  • You can put a datastore in maintenance mode, which evacuates all VMs from that datastore to the remaining datastores in the datastore cluster.
  • Storage DRS works on VMFS and NFS datastores, but they must be in separate clusters.
  • Affinity rules can be created for VMDK affinity, VMDK anti-affinity and VM anti-affinity.

Profile-Driven Storage

  • Allows you to match storage SLA requirements of VMs to the right datastore, based on discovered properties of the storage array LUNs via Storage APIs.
  • You define storage tiers that can be requested as part of a VM profile. So during the VM provisioning process you are only presented with storage options that match the defined profile requirements.
  • Supports NFS, iSCSI, and FC
  • You can tag storage with a description (.e.g. RAID-5 SAS, remote replication)
  • Use storage characteristics or admin defined descriptions to setup VM placement rules
  • Compliance checking

Fibre Channel over Ethernet Software Initiator

  • Requires a network adaptor that supports FCoE offload (currently only Intel x520)
  • Otherwise very similar to the iSCSI software initiator in concept

iSCSI Initiator Enhancements

  • Properly configuring iSCSI in vSphere 4.0 was not as simple as a few clicks in the GUI. You had to resort to command line configuration to properly bind the NICs and use multi-pathing. No more! Full GUI configuration of iSCSI network parameters and bindings.

Storage I/O Control

  • Extended to NFS datastores (VMFS only in 4.x).
  • Complete coverage of all datastore types, for high assurance VMs won’t hog storage resources

VAAI “v2”

  • Thin provisioning dead space reclamation. Informs the array when a file is deleted or moved, so the array can free the associated blocks. Compliments storage DRS and storage vMotion.
  • Thin provisioning out-of-space monitors space usage to alarm if physical disk space is becoming low. A VM can be stunned if physical disk space runs out, and migrated to another datastore, then resume computing without a VM failure. Note: This was supposed to be in vSphere 4.1 but was ditched because not all array vendors implemented it.
  • Full file clone for NFS, enabling the NAS device to perform the disk copy internally.
  • Enables the creation of thick disk on NFS datastores. Previously they were always thin.
  • No more VAAI vendor specific plug-ins are needed since VMware enhanced the T10 standards support.
  • More use of the vSphere 4.1 VAAI “ATS” (atomic test and set) command throughout the VMFS filesystem for improved performance.

I’m excited about the dead space reclamation feature, however, there’s no mention of a tie-in with the guest operating system. So if Windows deletes a 100GB file, the VMFS datastore doesn’t know it, and the storage array won’t know it either so the blocks remain allocated. You still need to use a program like sdelete to zeroize the blocks so the array knows they are no longer needed. You can check out even more geeky details at Chad Sakac’s blog here.

Hopefully VMware can work with Microsoft and other OS vendors to add that final missing piece of the puzzle for complete end-to-end thin disk awareness. Basically the SATA “TRIM” command for the enterprise. Maybe Windows Server 2012 will have such a feature that VMware can leverage.

Storage vMotion

  • Supports the migration of VMs with snapshots and linked clones.
  • A new ‘mirror mode’, which enables a one pass block copy of the VM. Writes that occur during the migration are mirrored to both datastores before acknowledged to the OS.

 If you want to read more in-depth explanations of these new features, you can read the excellent “What’s New in VMware vSphere 5.0 – Storage” by Ducan Epping here.

VMware vSphere 5.0 Announced!

In case you were living under a rock today, or don’t have lots of RSS subscriptions for virtualization blogs, you may not have heard that VMware announced their vSphere 5.0 product today. Although not shipping until late in Q3 of 2011, the cat is now out of the bag and technical details are abundant. This is a huge release with hundreds of new features and tweaks, so I’m sure the blogosphere will be crammed with great details over the coming months.

VMware had an online virtual product release with several webinars, live Twitter feeds and live Q&A. So in a series of posts I’ll just cover some of the very high level new features, so you get a feel of the magnitude of the updates and hopefully get you interested in reading more on your own.

A few of the major feature enhancements include:

  • Exclusive use of the ESXi hypervisor. No more ESX.
  • Auto Deploy. Uses host profiles to provide stateless computers with no local storage. Enables you to rapidly provision new servers and centralize patch management. No no longer really patch servers, you reboot the server and it will download a whole new image.
  • Storage DRS. Tiered storage based on performance characteristics. Load balance VMs based on I/O profile and align with SLAs. You can put a datastore in maintenance mode and all VMs will be vMotioned to other datastores.
  • Added support for NFS storage I/O control (previously limited to block storage)
  • Per-VM network I/O controls, to help eliminiate noisy neighbors.
  • VMs can now support 3D graphics
  • Supports client-connected USB devices
  • Support for USB 3.0
  • Supports smartcard readers
  • Mac OS X server support
  • Hardware VM version has been increased to v8.0 and EFI virtual BIOS
  • VM limits increased to 32 vCPUs, 1TB RAM, support 1,000,000 IOPS, >36Gb/s network throughput
  • Brand new HA architecture. Supports larger clusters, simplified setup, and more reliable.
  • vCenter appliance running on Linux. Only supports Oracle DBs. Didn’t VMware learn from vCloud? Not as full featured as the Windows version.
  • Brand new web client to manage vSphere from anywhere.
  • Networking supports Netflow, SPAN support and LLDP
  • ESXi now has a built-in firewall
  • VMFS version increased to 5.0 (online non-disruptive update from prior versions)
  • VMFS support for datastores up to 64TB without using extents
  • VAAI v2
  • Software FCoE initiator
  • vMotion support for higher latency links (up to 10ms)
  • Dropped the “Advanced” licensing SKU
  • Licensing is now based on CPU sockets AND vRAM (see my licensing post here). No more core/memory limitations.
  • vCenter Heartbeat 6.4 supports SQL Server 2008 R2 and vCenter plug-in for monitoring.
  • New vSphere Storage Appliance

Nearly each feature could have a dedicated blog post about it, so this is just a small snapshot of some features. Other products like SRM and vShield have also undergone major updates. Stay tuned for a lot more post about new features.

3PAR vSphere VAAI "XCOPY" Test Results: More efficient but not faster

In my previous blog I discussed how the VMware 4.1 VAAI ‘write same’ implementation in a 3PAR T400 showed a dramatic 20x increase in performance, creating an eager zeroed thick VMDK at 10GB/sec (yes, GigaBYTES a second). The other major SCSI primitive that VAAI 4.1 leverages is XCOPY (SCSI opcode 0x83). What this does is basically offload the copy process of a VMDK to the array, so all of the data does not need to traverse your SAN or bog down your ESX host.

In this test I used the same configuration as described in my previous blog entry. I decided to perform a storage vMotion of a large VM. This VM had three VMDKs attached, to simulate real world usage. The first VMDK was 60GB and had about 5GB of operating system data on it. The next two VMDKs were eager zerod thick disks, 70GB and 240GB, and had no user data written to them. Total VMDK size is 370GB. I initiated a storage vMotion process from vCenter 4.1 to start the copy process.

“XCOPY” without VAAI:
Host CPU Utilization: ~3916 MHz
Read/write latency: 3-4ms
3PAR host facing port aggregrate throughput: 616MB/sec
3PAR back-end disk port aggregrate throughput: ~0MB/sec
Time to complete: 20 minutes

These results are very reasonable, and quite expected. Since VAAI was not used, the ESXi host has to read 370GB of data, then turn it right around and write 370GB of data to the disk. So in reality over 740GB of data traversed the SAN during the 20 minute storage vMotion process. Since the VMDKs only contained 1% written data, back-end disk throughput was nearly zero because of the ASIC zero detection feature. If the VMDKs were fully populated then the back-end ports would be going crazy and the copy would be slower since all I/Os would be hitting physical disks.

“XCOPY” with VAAI:
Host CPU Utilization: ~3674 MHz
Read/write latency: 3-4ms
3PAR host facing port aggregrate throughput: ~0MB/sec
3PAR back-end disk port aggregrate throughput: ~0MB/sec
Time to complete: 20 minutes

Now I’m pretty surprised at these results, and not in a positive fashion. First, it’s good to see nearly zero disk I/O on the host facing ports and the back-end ports. This means in fact VAAI commands were being used, and that the VMDKs were nearly all zeros. However what has me very puzzled is that the copy process took exactly the same amount of time to complete, and used nearly the same amount of host CPU. I repeated the tests several times, and each time I got the exact same results…20 minutes.

Since there’s virtually no physical disk I/O going on here, I would expect a dramatic increase in storage vMotion performance. Because these results are very surprising and unexpected, I contacted 3PAR and I will see if engineering can shed some light on this situation. Other vendors claim a 25% increase in storage vMotion performance when using VAAI. Clearly 0% is less than 25%. When I get clarification on what’s going on here, I will be sure to follow up.

Update: 3PAR got back to me about my observations, and confirmed what I’m seeing is correct. With firmware 2.3.1 MU2 XCOPY doesn’t reduce the observed wall clock time to “copy” empty space in a thinly provisioned volume. But as I noted, XCOPY does leverage the zero detection feature of their ASIC so there’s very little back-end I/O occuring for non-allocated chunklets.

So yes the current VAAI implementation reduces the I/O strain on the SAN and disk array, but doesn’t reduce the observed time to move the empty chunklets. In my environment the I/O loads are pretty darn low, so I’d prefer the best of both worlds…efficient copies and reduced observed copy times. If 3PAR could make the same dramatic performance gains of the ‘write same’ command for the XCOPY command, that would really be a big win for customers.

3PAR vSphere VAAI "Write Same" Test Results: 20x performance boost

So in my previous blog entry I wrote about how I upgraded a 3PAR T400 to support the new VMware vSphere 4.1 VAAI extensions. I did some quick tests just to confirm the array was responding to the three new SCSI primitives, and all was a go. But to better quantify the effects of VAAI I wanted to perform more controlled tests and share the results.

Environment
First let me give you a top level view of the test environment. The host is an 8 core HP ProLiant blade server with a dual port 8Gb HBA, dual 8Gb SAN switches, and two quad port 4Gb FC host facing cards in the 3PAR (one per controller). The ESXi server was only zoned to two ports on each of the 4Gb 3PAR cards, for a total of four paths. The ESXi 4.1 Build 320092 server was configured with native round robin multi-pathing. The presented LUNs were 2TB in size, zero detect enabled, and formatted with VMFS 3.46 and using an 8MB block size.

Testing Methodology
My testing goal was to exercise the XCOPY (SCSI opcode 0x83) and write same (SCSI opcode 0x93). To test the write same extension, I wanted to create large eager zeroed disks, which forces ESXi to write all zeros to the entire VMDK. Normally this would take a lot of SAN bandwidth and time to transfer all of those zeros. Unfortunately I can’t provide screen shots because the system is in production, so you will have to take my word for the results.

“Write Same” Without VAAI:
70GB VMDK 2 minutes 20 seconds (500MB/sec)
240GB VMDK 8 minutes 1 second (498MB/sec)
1TB VMDK 33 minutes 10 seconds (502MB/sec)

Without VAAI the ESXi 4.1 host is sending a total 500MB/sec of data through the SAN and into the 4 ports on the 3PAR. Because the T400 is an active/active concurrent controller design, both controllers can own the same LUN and distribute the I/O load. In the 3PAR IMC (InForm Management console) I monitored the host ports and all four were equally loaded around 125MB/sec.

This shows that round-robin was functioning, and highlights the very well balanced design of the T400. But this configuration is what everyone has been using the last 10 years..nothing exciting here except if you want to weight down your SAN and disk array with processing zeros. Boorrrringgg!!

Now what is interesting, and very few arrays support, is a ‘zero detect’ feature where the array is smart enough on thin provisioned LUNs to not write data if the entire block is all zeros. So in the 3PAR IMC I was monitoring the back-end disk facing ports and sure enough, virtually zero I/O. This means the controllers were accepting 500MB/sec of incoming zeros, and writing practically nothing to disk. Pretty cool!

“Write Same” With VAAI: 20x Improvement
70GB VMDK 7 seconds (10GB/sec) 
240GB VMDK 24 seconds (10GB/sec)
1TB VMDK 1 minute 23 seconds (12GB/sec)

Now here’s where your juices might start flowing if you are a storage and VMware geek at heart. When performing the exact same VMDK create functions on the same host using the same LUNs, performance was increased 20x!! Again I monitored the host facing ports on the 3PAR, and this time I/O was virtually zero, and thanks to zero detection within the array, almost zero disk I/O. Talk about a major performance increase. Instead of waiting over 30 minutes to create a 1TB VMDK, you can create one in less than 90 seconds and place no load on your SAN or disk array. Most other vendors are only claiming up to 10x boost, so I was pretty shocked to see a consistent 20x increase in performance.

In conclusion I satisfied myself that 3PAR’s implementation of the “write same” command coupled with their ASIC based zero detection feature drastically increases creation performance of eager zeroed VMDK files. Next up will be my analysis of the XCOPY command, which has some interesting results that surprised me.

Update: I saw on the vStorage blog they did a similar comparison on the HP P4000 G2 iSCSI array. Of course the array configuration can dramatically affect performance, so this is not an apples to apples comparison. But nevertheless, I think the raw data is interesting to look at. For the P4000 the VAAI performance increase was only 4.4x better, not the 20x of the 3PAR. In addition, the VDMK creation throughput is drastically slower on the P4000.

Without VAAI:
T400 500MB/sec vs P4000 104MB/sec (T400 4.8x faster)

With VAAI:
T400 10GB/sec vs P4000 458MB/sec (T400 22x faster)

3PAR VAAI Upgrade is cakewalk

For those of you using vSphere 4.1, one of the cool new features is VAAI support. What is VAAI? VAAI is a deep level of integration between select storage arrays and the ESX kernel. The three VAAI functions released in 4.1 are:

•Atomic Test & Set (ATS), which is used during creation of files on the VMFS volume
•Clone Blocks/Full Copy/XCOPY, which is used to copy data
•Zero Blocks/Write Same, which is used to zero-out disk regions

Arrays need firmware updates to support these enhanced SCSI commands. Since vSphere 4.1 was released storage vendors have been releasing firmware updates for their arrays. Today I upgraded our 3PAR T400 to their 2.3.1 MU2 code base, which has VAAI support. Like I blogged about back in February, the 3PAR upgrades are fully non-disruptive, fairly straight forward, and not so complicated they need professional services.

I found a script which makes the verification, enabling, and disabling of the features a simple one liner, and it can be found here. For a little trivia, there was supposed to be fourth VAAI SCSI primitive, ‘thin provision stun’. I bet a Star Trek fan came up with that feature name. Basically this feature enabled the array to tell a VM that it ran out of physical disk space on the LUN and ESX would ‘stun’ the VM so it wouldn’t crash or corrupt data. But as the rumor goes, there was some miscommunication between VMware and various partners so not all partners implemented or certified the stun primitive. To put everyone on a level playing field the fourth primitive was dropped. I would expect it to make an appearance in a future release.

Due to time constraints and the approaching weekend, I didn’t have time to run any vSphere tests and look at SCSI stats to verify the VAAI commands are working. That will come over the next week or two, and I plan to blog on the results.

For those of you looking at buying new storage arrays and using them with VMware, one of the basic checklist features you should use as screening criteria is VAAI support. Finally, NetApp has a great PDF that goes into good details on how VAAI works and the use cases. While it contains some NetApp specific information, the majority of the document is a good read for anyone interested in VAAI.