Satya Nadella’s Microsoft is often described as a “new Microsoft,” and if any one part of the company embodies that description, it’s the team that’s building the Microsoft Azure cloud platform. Azure’s collection of services isn’t designed to support Windows developers or Windows applications alone; it’s built on open APIs and standards that open it to anyone wanting to build a cloud-hosted or cloud-serviced application.
One of the more important parts of Azure is perhaps its least known: DocumentDB. Designed to be a global-scale NoSQL database, it’s the back end for many of Microsoft’s own services. There’s a lot to like in DocumentDB, from its MongoDB-compatible APIs to its innovative approach to consistency when working across multiple datacenters. It’s also low-latency.
DocumentDB: A database for a distributed world
Concurrency and consistency are the classic problems for anyone building distributed systems. How can you guarantee that all your users see the same view of the data, or at least see the data they need to use right now, when data is being written by many thousands of users in regions that can be continents apart?
That’s where the two most commonly used consistency models come in to play: strong and eventual.
- Strong consistency means that your applications wait until all data is replicated between each instance of your database, ensuring that everyone accessing the data gets the same view—but preventing new writes until that consistent view is achieved. It also means your data needs to stay in the same Azure geographical region.
- Eventual consistency is a much more relaxed approach, in which users get access to the current state of their local database instance—so there’s no guaranteed level of consistency. It’s fast, but your queries may not give you the latest data.
DocumentDB introduces two new consistency models that have proven to be very popular with developers building apps that use the service.
The first option, bounded staleness, lets you define two alternate approaches to consistency: that you’re guaranteed to be within a specified number of versions of the data by which reads lag writes, or that you can set a time limit by which all the data is consistent. You can choose, for example, to ensure your data is always consistent after 20 seconds or you’re at most only two versions of a document behind the last write. As you’re putting boundaries on how fast DocumentDB replicates data between instances, there’s no limit on the number of Azure geographic regions you can use for your data.
The second option is session consistency, where consistency is linked to a client session. That’s a relevant option where you’re employing a cloud database as a back end to handle data, with data replicated across instances and regions while a user is connected to only one instance. The result is a database that responds quickly to both reads and writes. But this approach requires you to think carefully about what data you’re storing in it.
You can even mix and match consistency approaches using different DocumentDB instances for different parts of an application. User and session data, where write and read speed are important, can be handled by session consistency, while other aspects of an application where writes are less critical and you’re looking for fast reads can be handled via bounded staleness.
The realities of working with DocumentDB
Microsoft has designed DocumentDB to be elastic, able to scale up and down when adding new databases and collections to an account. Each database you create is made up of collections of JSON-formatted data, as well as a list of users. It’s certainly highly scalable – Microsoft itself has databases that have thousands of collections and contain terabytes of data.
One note: Although Microsoft talks about databases containing users, DocumentDB has a different concept of users from the rest of us. Instead of mapping to an individual account, a DocumentDB user is more abstract, perhaps best thought of as a way of naming access-control policies, so you manage what applications and services have access to what data and how they can use it.
Containers can contain, well, anything. DocumentDB is intended to be schema-free, and content is automatically indexed as you add it to a collection. It’s an approach that makes it harder to query your data because you need to know what you’re looking for. That’s why, in practice, you’re likely to use the common NoSQL key/value pattern for your data, giving you the tools you need to build queries.
Like much of Azure, DocumentDB is a pay-as-you-go service, so you’ll need to keep a close watch on what you’re using in terms of storage and bandwidth. (Microsoft offers a mix of pricing options, for scalable databases and for fixed sizes with fixed performance levels.) There’s a cloud sandbox to help you design and build queries, and there’s a local emulator you can use to design your database before deploying anything on Azure. But don’t think about using the emulator in production: It’s not designed to scale and will handle only a few containers.
Applications communicate with the service over a set of REST APIs, which you can call directly. In practice, though, you’re much more likely to use one of Microsoft’s SDKs, which include .Net and Node.js, as well as Java and Python. It’s a lot easier building and delivering JSON documents via an SDK than assembling them by hand and handling asynchronous calls and responses directly. There’s sample code on GitHub to help you get started.
A DocumentDB document is really a JSON blob
The biggest problem Microsoft has with DocumentDB is its name. Most of us aren’t familiar with the concept of blobs of JSON-formatted data being called “documents,” when they can contain anything you can wrap in a binary-encoded format, then deliver via REST to an API. Calling it DocumentDB makes it seem that all you can store in it are Office files and raw text files, when DocumentDB is really the flexible back end you need to build a modern born-in-the-cloud application.
DocumentDB has often been described within Microsoft as a “planetary-scale database.” As we build more and more software in the cloud, using more and more distributed application concepts, a planet-scale database makes a lot of sense. All we need to do is design our applications to take advantage of it, thinking about how we prioritize both reads and writes across our distributed code.