NVM Express Has Transformed Fast Storage in the Cloud – and it’s Coming to Enterprise

Why NVM Express is essential to the modernization of storage


If we go back to 2009, the work to define NVM Express (NVMe) had just begun, and around that same time, Microsoft Azure made its first major purchase of Serial ATA (SATA) based solid-state drives (SSDs) to help accelerate its storage servers by offloading the storage commit log. Microsoft knew that non-volatile memory (e.g, NAND Flash) would become very important for cloud computing, although it was still very expensive and restricted to use in limited applications. It was also clear to Microsoft that the SATA interface would be unsuitable to handle the performance of future high performance SSDs.

In 2011, as NVM Express had just had its first revision published, a competing standards proposal came on the scene called SCSI Express. SCSI Express was designed to use the traditional SCSI command set (including three decades of legacy infrastructure for hard drives) on top of the PCI Express interface. Looking at the technical merits, there was consensus in the Microsoft Windows and Azure teams that NVMe was the better interface definition for NAND Flash and to scale to future NVM technologies (e.g., 3D XPoint™ Technology, MRAM, etc). Microsoft worried that competing standards would lead to many headaches in the market, and thus, Microsoft Azure decided at the time to support the NVMe standard and collaborate with other industry leaders promoting NVMe including Cisco, Dell, Intel, Micron, Microsemi, NetApp, Oracle, Samsung, Seagate, WD to avoid bifurcation of the market by investing in NVMe to ensure it met the needs of the Cloud. In this article, we will describe the journey of the past five years, where today NVMe is the predominant SSD interface used by Microsoft Azure.

The foundation of NVM Express SSDs is the PCI Express* physical interface. At the time that NVMe was under definition, there were several proprietary PCIe* based SSDs in the market, the most notable of which was from Fusion IO. The proprietary PCIe* SSDs provided storage IOPs and throughput that were incredible for the time, and came at an impressive price premium (e.g., more than $10/GB). One challenge was that the proprietary PCIe* SSDs placed a portion of their logic in proprietary drivers that were specific to the SSD vendor. It became challenging to scale this type of approach, as each SSD vendor needed a unique storage driver for each operating system and often had unique feature sets.

For PCIe* based SSDs to gain broad market adoption, a standard interface from the operating system down was required. NVM Express is THE standard host controller interface (i.e., storage driver interface) that is built on top of the PCI Express bus protocol. Very capable NVMe drivers are in place across all major OSes, enabling use of any NVMe device. And, as NVMe is extended with new features, these capabilities can easily be supported via extensions to the existing software. Microsoft has led the way with support for NVM Express in Microsoft Windows* for both server and client.

Figure 1: NVMe as a foundational component of the storage software stack

fig1intel source: Intel

With faster non-volatile memory technologies on the horizon, NVMe was designed from the beginning for low latency. NAND Flash is very fast in comparison to hard drives (e.g., reads take ~ 100 microseconds versus 10 milliseconds). However, as we look forward to 3D XPoint™ Technology, the media itself can be read in under a microsecond. This means that the storage protocol layered on top of next generation NVM technologies needs to be as efficient and streamlined as possible – enabling Cloud, Enterprise, and IT customers to take advantage of the significant industry investment in NVM Express with these new NVM technologies.

At its core, NVM Express enables the parallelism of the underlying SSDs to shine. The interface allows the device to expose up tens to thousands of independent command queues that each software thread can use to submit work to the SSD with no synchronization or locks with other software threads. Each command queue may have tens to thousands of commands outstanding. We have already seen in many cases where command queues are directly used by user level applications, avoiding the overhead of kernel transitions. In addition to the parallelism, each individual command is only 64 bytes in size with a mandatory command set of fewer than 20 commands. This enables the interface to be automated effectively in hardware, providing even higher IOPs and lower latency.

Figure 2: NVMe exploits the parallelism of the NAND Flash die within the SSD

fig2intel source: intel

The parallelism and scalability of NVMe leads to real world performance benefits. As an example, let’s look at IOs per second – a measure of random IO performance. Serial ATA was capped at ~ 150,000 IOPs due to architectural choices (e.g., interrupt processing on one core) that limit host efficiency. NVMe has already been demonstrated to exceed 3,000,000 IOPs by ensuring a highly parallel and scalable architecture. Almost all NAND Flash in Microsoft Azure data centers are shared resources. We found that when we combined customer workloads on SATA based SSDs, it was difficult to deliver consistent performance to all customers at the same time. The increased throughput and IOPs capabilities of NVMe based SSDs reduces this problem by an order of magnitude. We can offer higher total throughput to our customers while achieving lower variation in latency.

Microsoft Azure purchases almost exclusively NVMe SSDs today, where NAND Flash is one of the biggest investments made in data center hardware. The cost of NAND Flash purchased for the data center is the single largest commodity spend. However, the demands of the Cloud market are constantly changing, and thus Microsoft, Intel, and other industry leaders are working together in the NVM Express standards committee to add features and capabilities to this solid foundation to meet the evolving challenges.

One of the challenges in building Cloud computing infrastructure is managing the resources in the data center used to build the services that Microsoft Azure provides (e.g., hosting customer data). There is a proliferation of solutions leveraging NVMe SSDs to provide dedicated in-rack flash arrays – Just a Bunch of Flash (JBOF). An NVMe JBOF often consists of ~ twenty NVMe SSDs with a storage controller that connects to the data center fabric using Ethernet. This enables disaggregation of high performance NVMe storage for cloud services – where storage is in its own set of “boxes” and compute is in a separate set of “boxes”. This enables storage performance and features to scale at their own pace – being swapped out independently of CPUs and other traditional server resources. An issue was that each of these JBOFs used their own proprietary mechanism to communicate to the VM hosting server. The NVMe committee developed the NVMe over Fabrics standard, released in June of 2016, to standardize using NVM Express as the protocol across a data center fabric (like Ethernet) to a set of PCI Express-based NVMe SSDs. This enables NVM Express SSDs to more easily scale from ~ twenty SSDs in the VM hosting server to thousands of SSDs across a data center.

Figure 3: Using NVMe across the datacenter, including over a fabric like Ethernet

fig3intel Intel

Another challenge for the Cloud is SSDs are becoming very large (e.g., 16 TB in a single device). Most applications running in the public cloud leverage Virtual Machines (VMs) for deploying computation. Today, the way that storage is provided to a VM is through a mechanism called para-virtualization. In para-virtualization, the operating system hosting the VM creates the fiction of a storage device through software. The VM then implements a virtual storage device by leveraging local and/or remote storage resources. The cost of the fictitious software storage device becomes a bottleneck as the underlying storage device becomes faster. The NVMe committee has delivered standardized support for hardware virtualization capabilities in NVMe revision 1.3, released in the first half 2017. This allows all VM types run in the cloud whether on the many flavors of Linux or different versions of Windows to benefit from the performance of NVMe storage via hardware virtualization support leveraging standard NVMe drivers.

Adoption of a new storage interface standard in a volume fashion is around a 5 year endeavor. The NVM Express standard was complete in 2011, with first device shipping in 2014, and broad deployments in 2016. NVM Express has already reached the “tipping point” in the Cloud and is quickly becoming the standard interface for SSDs in Enterprise and IT. In 2017, it is projected that more NAND Flash will ship in NVMe SSDs than on SATA or SAS based SSDs. Make sure your Enterprise and IT investments consider NVM Express – you can’t afford to miss out on NVMe in your infrastructure, whether with NAND Flash today or next generation NVM technologies in the future. Learn more at nvmexpress.org.

fig4intel source: Intel

Figure 4: NVM Express, SATA, and SAS Adoption in Gigabytes of NAND