Among the many improvements in VMware vSphere 5.0 is a new means to load balance virtual machine datastores: the Storage Dynamic Resource Scheduler, or Storage DRS. Like its pre-existing compute load-balancing counterpart (plain old DRS), Storage DRS promises to make managing large virtualization environments easier by taking some of the guesswork out of virtualization storage provisioning. But proceed with caution: Storage DRS can result in unintended consequences that may prevent customers who could most benefit from it from being able to use it at all.
Storage DRS allows you to group together multiple datastores into datastore clusters. Those clusters can consist of either block-level VMFS volumes or NFS mount points -- providing welcome support for NFS, which has been neglected in previous VMware storage tech releases. Once configured, virtual machines can be load balanced across all datastores in a cluster based on available capacity, datastore performance, or both. The load balancing can operate autonomously, or it can simply recommend when it thinks you should make a change and allow you to approve it -- providing an easy way to determine how it will work in your environment before you let it off its leash.
Just as standard DRS makes use of vMotion to move virtual machines from one host to another, Storage DRS utilizes Storage vMotion to move virtual machines from one datastore to another. As with many other components of vSphere, Storage vMotion has also seen significant changes in vSphere 5.0.
Most important, Storage vMotion now makes use of a single-pass copy combined with a kernel-level mirror driver to synchronize mid-vMotion storage writes as they're being made. Previous versions used an iterative copy mechanism that leveraged Change Block Tracking. While this CBT-based method worked much more efficiently than previous snapshot-based mechanisms, it still could take a very long time to migrate VMs with significant I/O loads. The new implementation does a more effective job. Storage vMotion also supports migrating virtual machines that have snapshots or linked clones associated with them -- a feature not found in earlier implementations.
Performance load balancing and SIOC
Load balancing based on performance is substantially more complex than the fairly simple capacity-based load balancing. Fortunately, VMware has done a relatively good job of allowing you to modify many of the parameters that influence how the performance load balancing works and see the results in the form of latency and throughput metrics. However, understanding what it's doing requires a solid understanding of the underlying mechanisms largely underpinned by VMware's Storage I/O Control or SIOC.
Making its first appearance in vSphere 4.1, SIOC allows you to set storage priorities on individual virtual disks, which influence how much transactional storage throughput each is allowed when storage resources are constrained. Critical to that is detecting whether resources are being constrained in the first place -- a judgment based on whether the average latency for a given datastore has exceeded a defined latency threshold. Once that threshold is exceeded for even a few seconds, the number of I/O requests each VM is allowed to queue are constrained in proportion to its storage priority (storage "shares"). This effectively prevents a single VM from swamping a datastore with I/O and drowning out other potentially more important VMs.
Much of this same existing SIOC tech is used by Storage DRS with a few important additions. When Storage DRS is first activated, it injects a range of storage workloads onto the datastores and monitors SIOC statistics to get a rough idea of what kind of performance the datastores will be capable of under load. This provides Storage DRS with a way to recommend where to initially place a virtual machine.
After virtual machines are running, Storage DRS constantly monitors SIOC statistics to determine whether any of the datastores in the cluster are routinely latency constrained over a long period of time (16 hours by default). If they are, it will iteratively migrate virtual machines to other datastores to balance the load across all datastores.
For example, since judgments on where to place and migrate VMs are based entirely on observed datastore latency, it follows that creating multiple datastores on the same virtualized disk array (or aggregate, disk group, RAID group, and so on) might have the effect of linking the transactional performance potential (and thus latency) of those datastores to each other in ways that Storage DRS can't predict. Load balancing these kinds of datastores isn't likely to be particularly effective. Fortunately, creating multiple volumes on a single group of physical disks has often been a function of the partition size limitations present in VMFS3 -- which the introduction of VMFS5 has largely made unnecessary.
Storage DRS also doesn't have visibility into other more advanced storage array technologies such as deduplication and automated tiering. In some cases, constantly shuffling virtual machines among multiple datastores can end up reducing the effectiveness of those features by constantly resetting block-based performance data that an array has built by observing I/O patterns over time or placing additional load on the inline dedupe engine used by some arrays.
However, the largest pitfall for Storage DRS can be found in environments where array-based replication is being used to implement site failover capability. Just about any storage array out there will see the Storage vMotion of a 500GB virtual machine as 500GB of new writes that need to be replicated to the other site -- soaking up a massive amount of site-to-site WAN bandwidth in the process.
Even if you're fortunate enough to have sufficient bandwidth to support that kind of replication turnover, environments using VMware's Site Recovery Manager have even more to worry about. SRM doesn't officially support the use of Storage vMotion on protected virtual machines since there are short windows during a migration when SRM won't know where a virtual machine actually is -- preventing adequate protection. SRM also doesn't support vSphere Replication (a new vSphere/SRM 5.0 feature) in combination with Storage DRS. You can get some details on both of those SRM-related issues and potential work-arounds from this blog post by VMware's Cormac Hogan.
Putting it all together