3PAR vSphere VAAI "Write Same" Test Results: 20x performance boost

So in my previous blog entry I wrote about how I upgraded a 3PAR T400 to support the new VMware vSphere 4.1 VAAI extensions. I did some quick tests just to confirm the array was responding to the three new SCSI primitives, and all was a go. But to better quantify the effects of VAAI I wanted to perform more controlled tests and share the results.

Environment
First let me give you a top level view of the test environment. The host is an 8 core HP ProLiant blade server with a dual port 8Gb HBA, dual 8Gb SAN switches, and two quad port 4Gb FC host facing cards in the 3PAR (one per controller). The ESXi server was only zoned to two ports on each of the 4Gb 3PAR cards, for a total of four paths. The ESXi 4.1 Build 320092 server was configured with native round robin multi-pathing. The presented LUNs were 2TB in size, zero detect enabled, and formatted with VMFS 3.46 and using an 8MB block size.

Testing Methodology
My testing goal was to exercise the XCOPY (SCSI opcode 0x83) and write same (SCSI opcode 0x93). To test the write same extension, I wanted to create large eager zeroed disks, which forces ESXi to write all zeros to the entire VMDK. Normally this would take a lot of SAN bandwidth and time to transfer all of those zeros. Unfortunately I can’t provide screen shots because the system is in production, so you will have to take my word for the results.

“Write Same” Without VAAI:
70GB VMDK 2 minutes 20 seconds (500MB/sec)
240GB VMDK 8 minutes 1 second (498MB/sec)
1TB VMDK 33 minutes 10 seconds (502MB/sec)

Without VAAI the ESXi 4.1 host is sending a total 500MB/sec of data through the SAN and into the 4 ports on the 3PAR. Because the T400 is an active/active concurrent controller design, both controllers can own the same LUN and distribute the I/O load. In the 3PAR IMC (InForm Management console) I monitored the host ports and all four were equally loaded around 125MB/sec.

This shows that round-robin was functioning, and highlights the very well balanced design of the T400. But this configuration is what everyone has been using the last 10 years..nothing exciting here except if you want to weight down your SAN and disk array with processing zeros. Boorrrringgg!!

Now what is interesting, and very few arrays support, is a ‘zero detect’ feature where the array is smart enough on thin provisioned LUNs to not write data if the entire block is all zeros. So in the 3PAR IMC I was monitoring the back-end disk facing ports and sure enough, virtually zero I/O. This means the controllers were accepting 500MB/sec of incoming zeros, and writing practically nothing to disk. Pretty cool!

“Write Same” With VAAI: 20x Improvement
70GB VMDK 7 seconds (10GB/sec) 
240GB VMDK 24 seconds (10GB/sec)
1TB VMDK 1 minute 23 seconds (12GB/sec)

Now here’s where your juices might start flowing if you are a storage and VMware geek at heart. When performing the exact same VMDK create functions on the same host using the same LUNs, performance was increased 20x!! Again I monitored the host facing ports on the 3PAR, and this time I/O was virtually zero, and thanks to zero detection within the array, almost zero disk I/O. Talk about a major performance increase. Instead of waiting over 30 minutes to create a 1TB VMDK, you can create one in less than 90 seconds and place no load on your SAN or disk array. Most other vendors are only claiming up to 10x boost, so I was pretty shocked to see a consistent 20x increase in performance.

In conclusion I satisfied myself that 3PAR’s implementation of the “write same” command coupled with their ASIC based zero detection feature drastically increases creation performance of eager zeroed VMDK files. Next up will be my analysis of the XCOPY command, which has some interesting results that surprised me.

Update: I saw on the vStorage blog they did a similar comparison on the HP P4000 G2 iSCSI array. Of course the array configuration can dramatically affect performance, so this is not an apples to apples comparison. But nevertheless, I think the raw data is interesting to look at. For the P4000 the VAAI performance increase was only 4.4x better, not the 20x of the 3PAR. In addition, the VDMK creation throughput is drastically slower on the P4000.

Without VAAI:
T400 500MB/sec vs P4000 104MB/sec (T400 4.8x faster)

With VAAI:
T400 10GB/sec vs P4000 458MB/sec (T400 22x faster)

Print Friendly, PDF & Email

Related Posts

Subscribe
Notify of
7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Anonymous
August 25, 2011 6:30 pm

Hi Derek, I have a semi-related question for you. I have a large VM environment attached to a T800. In the past our 3par SE has repeatedly told us that partition alignment isn’t a concern considering that the data stores are formatted with VMFS3. We had a meeting with our VMware reps today and they said we absolutely need to be concerned with alignment on the guest OS volumes. We are prepping a test for tomorrow, but I am curious if you align all host partitions? We will concentrate on the heavy hitting, I/O intensive VMs first, but I don’t… Read more »

August 25, 2011 6:35 pm

Your 3PAR SE rep is wrong, sorry to say. VMFS3 has no relationship to whether the guest OS has aligned I/O or not. Server 2003 and earlier DO NOT properly align volumes, whereas server 2008 and later DO. Citrix has a good whitepaper on VDI I/O here: http://support.citrix.com/article/CTX130632 You can check out another blog I wrote about disk alignment here: https://www.derekseaman.com/2011/06/align-your-partitions-with-vmware.html Where they state: In order to minimize the utilization of the storage sub-systems, it is best practice to fully align the file systems at all layers (i.e. VM, Hypervisor, Storage). “For optimal performance, the starting offset of a file… Read more »

October 14, 2011 3:03 pm

By observation I confirmed that the commands were working. For example, when creating an EZT VMDK I monitored the SAN switch ports and 3PAR ports for I/O activity and there was practically none. Same thing when doing a storage vMotion, no fabric to speak of. ESXTOP can also list number of hardware locks per second, so I was able to confirm locking worked as well.

Anonymous
November 7, 2011 5:12 am

Can anyone comment on what parameters they use for sector aligning a Window’s partition sitting on a 3Par F400.

AJ
April 18, 2012 8:30 am

Pity though, that the UNMAP command has to be turned off in vSphere 5 due to performance issues caused by it. When creating/deleting vmdk`s from the vSphere client, the responsetime of our F400 array goes up into the 300+ms depending on the size of the disk created. We have UNMAP turned off at the host side now, which sucks because now i have to manually zero out the free diskspace of the datastores when vm`s are deleted or SvMotioned. Even more strange is the fact that HP support wasn`t able to figure this out, i ended up resolving the issue… Read more »

Thomas
February 16, 2015 3:26 pm

Hi Derek

Just to check what is the block size of 3par storage (7400) ? we had quite a number of windows 2003 servers, the OS partition mis-alignment will be a issue on this storage ?

thanks
Thomas