So in my previous blog entry I wrote about how I upgraded a 3PAR T400 to support the new VMware vSphere 4.1 VAAI extensions. I did some quick tests just to confirm the array was responding to the three new SCSI primitives, and all was a go. But to better quantify the effects of VAAI I wanted to perform more controlled tests and share the results.
First let me give you a top level view of the test environment. The host is an 8 core HP ProLiant blade server with a dual port 8Gb HBA, dual 8Gb SAN switches, and two quad port 4Gb FC host facing cards in the 3PAR (one per controller). The ESXi server was only zoned to two ports on each of the 4Gb 3PAR cards, for a total of four paths. The ESXi 4.1 Build 320092 server was configured with native round robin multi-pathing. The presented LUNs were 2TB in size, zero detect enabled, and formatted with VMFS 3.46 and using an 8MB block size.
My testing goal was to exercise the XCOPY (SCSI opcode 0x83) and write same (SCSI opcode 0x93). To test the write same extension, I wanted to create large eager zeroed disks, which forces ESXi to write all zeros to the entire VMDK. Normally this would take a lot of SAN bandwidth and time to transfer all of those zeros. Unfortunately I can’t provide screen shots because the system is in production, so you will have to take my word for the results.
“Write Same” Without VAAI:
70GB VMDK 2 minutes 20 seconds (500MB/sec)
240GB VMDK 8 minutes 1 second (498MB/sec)
1TB VMDK 33 minutes 10 seconds (502MB/sec)
Without VAAI the ESXi 4.1 host is sending a total 500MB/sec of data through the SAN and into the 4 ports on the 3PAR. Because the T400 is an active/active concurrent controller design, both controllers can own the same LUN and distribute the I/O load. In the 3PAR IMC (InForm Management console) I monitored the host ports and all four were equally loaded around 125MB/sec.
This shows that round-robin was functioning, and highlights the very well balanced design of the T400. But this configuration is what everyone has been using the last 10 years..nothing exciting here except if you want to weight down your SAN and disk array with processing zeros. Boorrrringgg!!
Now what is interesting, and very few arrays support, is a ‘zero detect’ feature where the array is smart enough on thin provisioned LUNs to not write data if the entire block is all zeros. So in the 3PAR IMC I was monitoring the back-end disk facing ports and sure enough, virtually zero I/O. This means the controllers were accepting 500MB/sec of incoming zeros, and writing practically nothing to disk. Pretty cool!
“Write Same” With VAAI: 20x Improvement
70GB VMDK 7 seconds (10GB/sec)
240GB VMDK 24 seconds (10GB/sec)
1TB VMDK 1 minute 23 seconds (12GB/sec)
Now here’s where your juices might start flowing if you are a storage and VMware geek at heart. When performing the exact same VMDK create functions on the same host using the same LUNs, performance was increased 20x!! Again I monitored the host facing ports on the 3PAR, and this time I/O was virtually zero, and thanks to zero detection within the array, almost zero disk I/O. Talk about a major performance increase. Instead of waiting over 30 minutes to create a 1TB VMDK, you can create one in less than 90 seconds and place no load on your SAN or disk array. Most other vendors are only claiming up to 10x boost, so I was pretty shocked to see a consistent 20x increase in performance.
In conclusion I satisfied myself that 3PAR’s implementation of the “write same” command coupled with their ASIC based zero detection feature drastically increases creation performance of eager zeroed VMDK files. Next up will be my analysis of the XCOPY command, which has some interesting results that surprised me.
Update: I saw on the vStorage blog they did a similar comparison on the HP P4000 G2 iSCSI array. Of course the array configuration can dramatically affect performance, so this is not an apples to apples comparison. But nevertheless, I think the raw data is interesting to look at. For the P4000 the VAAI performance increase was only 4.4x better, not the 20x of the 3PAR. In addition, the VDMK creation throughput is drastically slower on the P4000.
T400 500MB/sec vs P4000 104MB/sec (T400 4.8x faster)
T400 10GB/sec vs P4000 458MB/sec (T400 22x faster)