All-flash storage arrays (AFAs) are receiving lots of attention these days and for good reason. Compared to spinning disks, AFAs provide dramatically better performance, take up less floor space, and even offer overall cost of ownership advantages. Nevertheless, the cost per gigabyte is still relatively expensive, which means that AFAs should not be deployed for every application workload. The obvious questions are which workloads are most cost-effectively deployed on flash storage and which vendor has the optimal flash storage product for your workloads.
We at Load Dynamix, a provider of storage performance validation solutions, have invested a great deal of time and energy on these questions. Our focus has been to develop advanced workload modeling and load generation solutions for both storage technology vendors and IT organizations.
Load DynamiX combines an intuitive storage workload modeling application -- Load DynamiX Enterprise -- with a purpose-built load generation appliance. The solution generates massive, highly realistic loads that stress networked storage infrastructure to its limits and beyond, helping storage architects and engineers fully understand storage system behavior and performance characteristics before purchase and deployment decisions are made.
Based on deep experiences with global 2000 companies and in collaboration with flash storage visionaries within the industry (both leading vendors and well-known analysts), Load DynamiX has created a performance validation methodology for AFAs. Below, I’ll share some of the foundational aspects of Load DynamiX’s flash performance testing and describe two specific methodologies for understanding the performance of AFAs. A more detailed AFA testing methodology can be found at the Load DynamiX website.
Performance testing of AFAs is important if you have these kinds of questions:
- Can I improve application performance with flash? By how much?
- Can I afford the performance improvement? Will dedupe/compression reduce the effective cost per gigabyte without substantially impacting performance?
- How do I select the best vendor or product?
- Which of my workloads will run best on all-flash arrays?
- How can I optimize all-flash storage configurations?
- How much does my performance degrade with dedupe, compression, snapshots, and so on?
- Where are the performance limits of potential configurations?
- How will flash storage behave when it reaches its performance limits?
- Does flash performance degrade over time?
- For which workloads should I use an all-flash array or a hybrid flash array?
While everyone agrees that the most accurate way to test performance is in a production environment, it’s simply not possible. The next best thing is a realistic, scalable test in a lab environment. Storage engineers have had decades to refine HDD-based array testing, but only a short time to learn about flash storage.
How flash storage is different
All flash arrays differ from traditional spinning disk arrays in behavior, performance, and often durability. For example, SSDs write and read in blocks, and they are limited in the number of writes that can be performed on a particular block. Sophisticated data reduction and array-wide wear-leveling techniques can dramatically increase SSD durability, in some cases beyond the expected life of a spinning disk drive.
While the performance advantages of all-flash arrays are well documented, much effort has gone into the design of modern all-flash arrays to attack the price premium of flash versus disk. Sophisticated efficiency techniques promise, for a few workloads (such as full-clone VDI), to bring the effective cost of all flash arrays within striking distance of disk arrays, and in some cases below the effective cost of an all-disk implementation.
Here are some of the biggest ways flash arrays differ from disk arrays and how these differences affect performance testing and evaluation:
Data deduplication and data compression. These data reduction techniques reduce both data storage footprints and transmission loads (bandwidth requirements). But because deduped and compressed data must be decompressed to use, it imposes additional computational costs and can therefore have a significant impact on application performance. Algorithms can vary greatly, and their differences can significantly affect performance. Because the economic payoff of flash may rely on reduced storage capacity requirements and different vendors handle data reduction techniques differently, the performance of a given AFA may differ widely depending on data type. Your test method and load generator must be extremely configurable for dedupe and compression.
Metadata. A great deal of the internal management of flash-based arrays is aimed at optimizing the performance and reliability of the media. Array performance and scale are greatly affected by where metadata is stored and how it is used. This is a big reason you must properly precondition a flash array (that is, write to each flash cell) before testing. Without preconditioning, you are likely to get artificially fast read results.
Workload profiles and scale. Hard disk arrays are capable of IOPS in the range of many thousands. Flash-based arrays can support IOPS in the hundreds of thousands. Workload profiles for which flash-based arrays are generally deployed are very different from the classic workloads of the past. The mixed virtualized workloads for which flash-based arrays can be deployed exhibit much more variability than traditional workloads. They include both extremely random and sequential data streams; a diverse mix of block sizes and read/write ratios; a mix of compressible/dedupable blocks and noncompressible/nondedupable blocks; and hot spots.
To test flash-based arrays to performance saturation points, you must be able to generate workloads rarely if ever seen on disk-based systems. And you must be able to reproduce the right I/O and data profiles at that scale. Your load generator must be both powerful and flexible.
Overprovisioning. To improve the probability that a write operation arriving from the host has immediate access to a pre-erased block, most but not all flash products contain extra capacity. Overprovisioning is common because it can help flash designers mitigate various performance challenges that result from garbage collection and wear leveling, among other flash management activities. Overprovisioning also increases the longevity of flash arrays. You should test at or near the maximum usable capacity recommended by the AFA vendor to assess the performance benefit of overprovisioning. Typical recommendations are 90, 95, and 99 percent of capacity.
Hot spots. Most real-world workloads exhibit hot spots (the characteristics of temporal and spatial locality). Garbage collection, which proactively eliminates the need for whole block erasures prior to every write operation, may exacerbate hot spots (garbage collection methods differ among vendors). Therefore, testing hot spots is advised, but its importance may vary by array vendor.
Protocols. You may have to throw out preconceptions learned from decades of HDD system testing. Storage protocols often achieve quite different performance levels with flash. Factors such as block sizes and error correction overhead can make a big difference in throughput and IOPS. You should test all of your file and block protocols -- the rules have changed.
Software services. Replication, snapshots, clones, and thin provisioning can be very useful for improving utilization, recovery options, fail-over, provisioning, and disaster recovery. However, implementation may have big performance impacts and must be accounted for in the testing methodology. The effects of these services may be different than what you find in HDD systems. It’s important to run workloads on newly created clones, and not merely create clones while workloads are present.
QoS at scale. Quality of service affects both infrastructure and application performance. Build and run your tests with QoS configured for how you plan to use it. As your load increases, measure the ability to deliver expected performance in mixed workload environments.
Effective cost of storage. Looking at raw cost per gigabyte is not a good way to compare storage costs. The key question to ask: How much is usable? Arrays vary widely in their conversion from raw storage to usable storage. Due to the inherent speed of flash, you can effectively use deduplication and compression to fit substantially more data on a given amount of raw storage. Also, it’s common to have to overprovision HDD storage aggressively to get the number of spindles necessary to deliver the performance required (a strategy called “short stroking”). Further, disk arrays often have to make extensive use of expensive cache memory in order to achieve performance SLAs. Finally, you must consider factors like power and space requirements. Flash typically takes a fraction of the power and space of a traditional disk-based array.
Of course, you need to ensure that your data reduction assumptions are realistic. Talk with your application vendors and storage vendors. Storage vendors have storage efficiency estimation tools that will give you an accurate idea of what to expect from their particular storage platforms. If you want to get a feel for how compressible your files will be, zip them and compare with unzipped sizes.
Storage engineers and architects considering all-flash arrays for their workloads must explore the behavior of these products and, as far as possible, assess their performance in the context of their expected workloads. With a robust validation process in place, storage engineers and architects can select and configure flash storage solutions for their workloads with a clear idea of their impact on both performance and cost in production.
Performance profiling and workload modeling
There are two primary methodologies for storage performance validation: performance profiling and workload modeling. While they take different approaches and have different goals, both should observe the following three guidelines, which are essential to any meaningful performance validation of all-flash arrays:
- Specific preconditioning of the array to create a state that has characteristics similar to an aged flash storage array, prior to applying load.
- Stressing of specific all-flash array behaviors, such as data reduction techniques, clones, snapshots, fail-over, replication, backups, and other enterprise features that affect performance and cost.
- Stressing the array with realistic emulations of typical supported workloads.
Performance profiling is sometimes called “performance corners testing” or “multidimensional benchmarking.” It provides a very useful outline of the workload-to-performance relationship, and in some cases is sufficient to support the engineer’s decisions. The objective of intelligent performance profiling is to characterize the behavior of a storage system under a large set of workload conditions. Doing so provides the storage engineer with a map of the behavior of the storage system, making it easy to understand where sweet spots or bottlenecks may be or which workload attributes most directly affect the performance of the system. Engineers can then use this information to optimally match their workloads to storage systems.
This methodology is characterized by an automated workflow that allows the user to iterate on any of the many workload characterization attributes (load profile, block size, command mix, and so on) to stress the storage system under dozens, hundreds, or even thousands of workload configurations, with aggregation of the data and presentation of results. This can be accomplished with custom scripting or with off-the-shelf test products like Load DynamiX Enterprise. For example, Figure 1 below shows the input screen of the Iterator function in Load DynamiX Enterprise. It’s configured to run 18 sequential tests, without scripting, to test the effect of compression ratios, number of workers, and block size on three KPIs.
In Figure 2 below, we see the results. In this figure, we’ve sorted on the IOPS column to find the configuration that results in the greatest IOPS (approximately 22,014). Sorting by latency would quickly show figures exceeding 6ms for 500 or more concurrent workers.
Workload modeling goes to a greater level of detail. Whereas the goal of performance profiling is to test under a wide range of workload conditions, the objective of workload modeling is to stress the storage system under a realistic simulation of the workloads it will actually be supporting in production. Workload modeling requires a prerequisite knowledge of the characterization of the workloads, usually based on the storage engineer’s knowledge of the application and data typically provided by storage monitoring utilities.