In my previous blog I discussed how the VMware 4.1 VAAI ‘write same’ implementation in a 3PAR T400 showed a dramatic 20x increase in performance, creating an eager zeroed thick VMDK at 10GB/sec (yes, GigaBYTES a second). The other major SCSI primitive that VAAI 4.1 leverages is XCOPY (SCSI opcode 0x83). What this does is basically offload the copy process of a VMDK to the array, so all of the data does not need to traverse your SAN or bog down your ESX host.
In this test I used the same configuration as described in my previous blog entry. I decided to perform a storage vMotion of a large VM. This VM had three VMDKs attached, to simulate real world usage. The first VMDK was 60GB and had about 5GB of operating system data on it. The next two VMDKs were eager zerod thick disks, 70GB and 240GB, and had no user data written to them. Total VMDK size is 370GB. I initiated a storage vMotion process from vCenter 4.1 to start the copy process.
“XCOPY” without VAAI:
Host CPU Utilization: ~3916 MHz
Read/write latency: 3-4ms
3PAR host facing port aggregrate throughput: 616MB/sec
3PAR back-end disk port aggregrate throughput: ~0MB/sec
Time to complete: 20 minutes
These results are very reasonable, and quite expected. Since VAAI was not used, the ESXi host has to read 370GB of data, then turn it right around and write 370GB of data to the disk. So in reality over 740GB of data traversed the SAN during the 20 minute storage vMotion process. Since the VMDKs only contained 1% written data, back-end disk throughput was nearly zero because of the ASIC zero detection feature. If the VMDKs were fully populated then the back-end ports would be going crazy and the copy would be slower since all I/Os would be hitting physical disks.
“XCOPY” with VAAI:
Host CPU Utilization: ~3674 MHz
Read/write latency: 3-4ms
3PAR host facing port aggregrate throughput: ~0MB/sec
3PAR back-end disk port aggregrate throughput: ~0MB/sec
Time to complete: 20 minutes
Now I’m pretty surprised at these results, and not in a positive fashion. First, it’s good to see nearly zero disk I/O on the host facing ports and the back-end ports. This means in fact VAAI commands were being used, and that the VMDKs were nearly all zeros. However what has me very puzzled is that the copy process took exactly the same amount of time to complete, and used nearly the same amount of host CPU. I repeated the tests several times, and each time I got the exact same results…20 minutes.
Since there’s virtually no physical disk I/O going on here, I would expect a dramatic increase in storage vMotion performance. Because these results are very surprising and unexpected, I contacted 3PAR and I will see if engineering can shed some light on this situation. Other vendors claim a 25% increase in storage vMotion performance when using VAAI. Clearly 0% is less than 25%. When I get clarification on what’s going on here, I will be sure to follow up.
Update: 3PAR got back to me about my observations, and confirmed what I’m seeing is correct. With firmware 2.3.1 MU2 XCOPY doesn’t reduce the observed wall clock time to “copy” empty space in a thinly provisioned volume. But as I noted, XCOPY does leverage the zero detection feature of their ASIC so there’s very little back-end I/O occuring for non-allocated chunklets.
So yes the current VAAI implementation reduces the I/O strain on the SAN and disk array, but doesn’t reduce the observed time to move the empty chunklets. In my environment the I/O loads are pretty darn low, so I’d prefer the best of both worlds…efficient copies and reduced observed copy times. If 3PAR could make the same dramatic performance gains of the ‘write same’ command for the XCOPY command, that would really be a big win for customers.
 
 