Why traditional storage QoS methods aren’t good enough

Only a flash array can deliver guaranteed, predictable storage performance -- and only if it was designed to do so

Storage quality of service (QoS) is critical for companies and service providers who want to deliver consistent primary storage performance to business-critical applications. For many applications, raw performance is not the major challenge. It’s often much more important to ensure that performance is both predictable and guaranteed. This is virtually impossible in traditional hybrid or disk-based storage infrastructures, no matter how advanced the chosen QoS method might claim to be.

Overprovisioning storage capacity (that is, disk drives or spindles) has long been a default for data center managers attempting to ensure high performance. This approach is not only a wasteful use of resources, but also requires an (often unwieldy) admin layer to address the limitations of existing storage architectures. Not to mention, overprovisioning drives is purely a reactive measure that makes it impossible to scale up capacity when performance is constrained and ultimately doesn’t bring us to predictable performance in any case.

Despite these fundamental problems with the underlying storage architecture, many storage providers still try to address performance through traditional QoS methods. Unfortunately, when multiple workloads are sharing a limited resource, even an advanced QoS architecture will fail to prevent resource-hungry applications (aka “noisy neighbors”) from draining performance from the other applications on the same system.

Traditional QoS methods and misconceptions

Many popular QoS methods used in storage today are simply inadequate. They address common operational functions like tiering, rate limiting, prioritization, hypervisor-based QoS, and caching. However, these functions don’t comprise a comprehensive, ground-up QoS solution.

Tiered storage combines different types of storage media -- from superfast SSDs down to slow HDDs running at 15,000 rpm or even 7,200 rpm -- to deliver different levels of performance and capacity. Therefore, application performance is dependent on the individual storage medium. To optimize that performance, predictive algorithms analyze historical information to determine which data is “hot” and which is “cold” -- that is, which data is accessed most frequently and which is not. Based on that analysis, the algorithms then move data to a high-performance SSD tier or a high-capacity HDD tier.

To put it bluntly, tiering simply cannot be used to create a viable storage QoS for predictable performance. In fact, tiering gives preferential treatment to noisy neighbors by classifying their data as hot and moving it to the SSDs. All other applications then have to make do with the lower-performance hard disk drives. On top of that, performance varies dramatically for individual workloads as the algorithms move data between media.

Rate limiting, on the other hand, sets fixed limits on the IO or bandwidth that each application can consume. This reduces the problem of noisy neighbors, but only by limiting the maximum performance for each application. Rate limiting can also result in significant latency. This one-sided approach protects the storage system, but it doesn’t deliver guaranteed QoS -- nor is it possible to set IOPS minimums.

With prioritization, applications are classed in terms of their importance relative to other apps in the system, usually with clearly defined rankings such as “mission critical,” “moderate,” or “low.” While prioritization can deliver a higher relative performance for some applications, there is no guarantee that the required level of performance will be provided. Also, noisy neighbors can actually get louder if they are prioritized ahead of other applications.

Hypervisor-based QoS takes the latency and response times of individual virtual machines and uses them as a basis for setting thresholds beyond which the system limits the IO rate for the respective machine. The problem with a hypervisor, though, is it has limited control over the underlying storage resources. Placing a storage QoS mechanism on the hypervisor layer and not on the storage itself does very little to address the challenges of multiworkload environments. The key issues to consider with a hypervisor-centric approach in front of traditional storage are the lack of IOPS control, possible performance degradation, the risk of forced overprovisioning, and general problems with coordination and orchestration.

Caching stores the hottest data in large DRAM or flash-based caches, which can offload a significant amount of IO from the disks. This is why large DRAM caches are standard on every modern disk-based storage system. But while caching certainly increases the overall throughput of the spinning disk system, it also causes highly variable latency. The overall performance of an individual application is strongly influenced by how cache-friendly it is, how large the cache is, and how many other applications are sharing the cache. In a dynamic cloud environment, the last of these factors is in constant flux. As a result, it is impossible to predict, much less guarantee, the performance of any individual application in a system based on caching.

Guaranteed QoS is an architecture

The fundamental issue with the above methods of storage QoS is that they have not been designed from the ground up to deliver guaranteed, predictable performance.

The only way to achieve effective storage QoS is to include it as an integral part of the system design. Only then is it possible to guarantee performance in all situations, including failure scenarios, system overload, variable workloads, and elastic demand. A genuinely future-proof architecture requires six core components to achieve real storage QoS.

First, real storage QoS requires an all-SSD architecture because only flash enables the delivery of consistent latency for every IO. Second, it has to be a true scale-out architecture, one that allows for linear (predictable) performance gains as systems scale. Third, the system must provide RAID-less data protection that ensures predictable performance in any failure condition, because with RAID, any drive failure necessarily impacts performance.

Fourth, the system must balance load distribution in a way that eliminates hot spots that create unpredictable IO latency. Fifth, the system must provide fine-grained QoS controls (with support for minimums, maximums, and temporary bursts in performance) that completely eliminate noisy neighbors and guarantee volume performance. Finally, the system must provide what amounts to “performance virtualization” -- namely, the ability to separate the provisioning of performance from the provisioning of capacity.

Simply adding QoS features to existing storage platforms may eliminate some performance bottlenecks, but it’s not enough to meet the much broader challenges found in dynamic cloud environments. A truly effective solution requires a specially designed architecture with integrated QoS features and the ability to solve problems fully and generally rather than on an individual basis. This will ensure that each application always has the resources it needs to run without variance or interruption.

Dave Wright is founder and CEO of SolidFire, a maker of all-flash storage arrays for cloud infrastructures.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.