The increasing complexity of modern storage arrays seems to know no bounds. While all of us are understandably excited when hot, new storage features arrive on the scene, it's important to remember that all of them come at a price. Their complexity tends to make these systems harder to troubleshoot -- and harder for storage vendors to test fully before they hit the market.
Each customer has its own way of using a new storage product to meet specific needs, so expanding the feature set multiplies the number of scenarios the vendor needs to test against. It's almost inevitable that something will slip past the folks in QA. That's precisely what happened after the release of vSphere 5.0's new INCITS T10-based VAAI implementation, resulting in VMware hastily telling customers to manually disable the offending feature before it could cause problems.
There's a lot we can learn from this particular case. Storage is no longer an amorphous bucket you pour your data into -- it's a living, breathing entity that can have as much of an impact on your environment as your users or the applications you run. Heavier integration with hypervisors and other software not only enhances the storage's ability to solve problems, but also to create them.
Thin provisioning and VMware
VMware's VAAI (vSphere APIs for Array Integration) is an umbrella term that that applies to a number of extensions to the T10 SCSI standard, which are designed to allow hosts to make more intelligent use of the storage they are attached to. Among these features, the SCSI UNMAP command has a lot of promise to dramatically increase the effectiveness of thin provisioning in virtualized environments.
The devil is in the details
Not long after the release of vSphere 5.0, it became clear that real-world use of the SCSI UNMAP command could have some unexpected consequences. When vSphere performs a Storage vMotion, it does a live copy of a virtual machine's disk from one datastore to another and then transitions the operation of that virtual machine to the new disk location. Once this transition is complete, the last step is to delete the old disk, thereby freeing the space on the old datastore's file system to be used for other purposes (not unlike moving a file from one drive to another on your PC).
When the disk array being used supports the T10 UNMAP extension, the vSphere host will also issue UNMAP commands for the range of disk blocks that the virtual machine's disks had been using -- freeing not only the space within the VMFS filesystem, but also the underlying disk blocks on the array. Though they seem similar, these two processes couldn't be more different.
In the first case, vSphere simply needs to mark those blocks as being free in its file allocation table, which can be done almost instantaneously. In the second, depending upon the size of the virtual machine that has been moved, vSphere may have hundreds of millions of block UNMAP commands to send to the array -- a process that's often far from instantaneous and depends heavily upon how the storage vendor in question has implemented the UNMAP command within its product. If you've formatted a disk in Windows using the "quick format" option which simply rewrites the allocation tables versus a full format which touches every block on the disk, you've seen something similar play out.
In the wild, this left Storage vMotion waiting for these commands to be executed -- long enough so that the Storage vMotion operation would time out and fail -- potentially leaving the migration in an inconsistent state. Because the support for this and other VAAI commands is automatically detected and enabled, VMware's only option was to issue a blanket declaration that removes support for the UNMAP command until the problem can be worked through with storage vendors and a permanent solution could be found.