Get started with Azure hyperscale databases

Microsoft builds on its existing Azure database tool to deliver hyperscale versions of PostgreSQL and Azure SQL

Getting started with Azure hyperscale databases
Getty Images

One of the advantages of the cloud is scale. We don’t call the big three cloud platforms hyperscale for nothing; they have massive data centers all around the world with millions of servers that can be treated as pools of compute and storage. A modern distributed application can run across many cores of compute, each with its own memory, all addressing terabytes of storage. We’re abstracted away from all the physical infrastructure that makes up the cloud, treating everything we use as another set of services.

Combining that approach with some of the newer cloud hardware, like servers that support the massive 128-core M-series Azure VMs which can address as much as 4TB of memory, has changed the type of applications we can build. Instead of limiting our code to fit the servers we have, we can build applications that take advantage of the available resources in the cloud. Even pooling standard VMs give us a platform where we can build large-scale systems, capable of working with truly big data.

The cloud and big, big data

That capability is perhaps the real value of the big cloud providers; their economies of scale mean they can purchase storage at a much lower cost than we can for our own data centers. With tools like Azure’s various Data Box devices we have the ability to link on-premises data sources to cloud services, either by wholesale shifting of files or by connecting on-premises networks to cloud storage. The prospect of delivering large amounts of data to the cloud is interesting because it mixes the data generation capabilities of modern business systems with the processing capabilities of the cloud.

If we can get our data to the cloud, how can we work with it? Until recently much of the work done on cloud-scale data processing focused on using tools like BigTable and Hadoop to analyze nonrelational data at scale. By using alternative data structures, we were able to process large amounts of data quickly, distributing our analysis across many compute nodes. Building on the technologies used to deliver consumer search engines such as Bing or Google has worked well for many classes of problem and many data sets. But it’s not what we need to work with the structured data in our line-of-business applications.

Relational databases like SQL Server and its cloud-sibling Azure SQL are familiar, powerful tools. Handling large structured stores, they’re the workhorses driving many familiar business applications. But they’re limited by the available processors in our servers and by the amount of disk we can put in our data centers. Queries are slowed down, especially complex cross-table joins.

Hyperscale data on Azure

Microsoft has been expanding its existing Azure database tool to deliver hyperscale versions of both Azure Database for PostgreSQL and Azure SQL. The PostgreSQL version builds on Microsoft’s recent acquisition of Citus, and its open source database scaling extension. Currently only available in preview, it horizontally scales your database across as many nodes as you need, parallelizing queries across hundreds of nodes with a single index. Similarly Azure SQL Hyperscale separates storage and compute, with the option of rapid scaling via read replicas.

Setting up a hyperscale database isn’t simply a matter of clicking a button on an Azure control panel, though that’s clearly Microsoft’s endgame. Some key roadblocks have been removed: If you’re using Hyperscale PostgreSQL, for example, the Citus-based tool will automatically shard your data across the nodes you’ve configured. You still need to choose how many nodes will host your database and then run the configuration tools to create tables to set up the default sharding; the default is 32 shards.

By sharding your database each node in the cluster works with less data, so queries can be faster and less load-intensive. Working across a cluster allows the hyperscale PostgreSQL instance to parallelize queries, adding an additional boost to operations as results come in from each node to the controller nodes which will assemble the results and deliver them to your applications. More nodes can be added on the fly as your database expands. Microsoft’s hyperscale databases aren’t only for queries, they also support distributed transactions, so you can use them in production as well as for analytics.

Introducing Azure SQL Hyperscale

Azure SQL’s hyperscale tier is a new option for one of the oldest Azure platform services. Designed for databases with up to 100TB of storage, it adds Azure Blob-based snapshots for rapid backups and recovery. Microsoft reports the Azure SQL hyperscale recovery process takes a few minutes, rather than the hours or days you might expect from on-premises systems. It’s tuned for quick scaling, either scale-out by adding nodes or scale-up with additional compute resources.

Interestingly Azure doesn’t set a maximum size for Azure SQL Hyperscale instances, since they grow and shrink as needed. You can start with the database you need today, and the service will scale as your requirements change, with only minimal administrative overhead. Microsoft is focusing on read-based workloads, like OLTP (online transaction processing) or data marts, where data is written once and read many times as part of an analytic process.

Microsoft has made some fundamental changes to the Azure SQL architecture, separating query processing from storage management, with an external log server. The same storage management components are used in read-only replicas, making them easy to quickly copy across to new nodes. Compute nodes handle queries, with SSDs for page-level caching of data, keeping network traffic to a minimum. Each page server manages up to 1TB of data, delivering pages of data to the compute nodes on demand, with data itself stored in Azure storage.

Setting up a new hyperscale Azure SQL instance is a matter of a single line of T-SQL code in an Azure SQL command line. You can use a one-way migration to convert an existing database to hyperscale, a process that’s best made with a replica of a database.

Both PostgreSQL and Azure SQL’s hyperscale services are priced on a consumption basis, with separate compute and storage charges. Neither are particularly cheap, so consider carefully what you’ll use them for and how you’ll use them. If you’re using them for analytics, one option is to consider only running compute nodes when necessary, so your biggest cost will be the monthly data storage fees.

Hyperscale databases are a logical development of the cloud, offering services that would be prohibitively complex on premises. By making it a matter of a command line call, Azure has put these capabilities into the hands of any developer, and support for Azure SQL and PostgreSQL gives you a choice of databases. Choice and simplicity are clearly the watchwords here, and it’s going to be interesting to see how these new Azure services are used—and if they justify Microsoft’s investment in Citus.

Copyright © 2019 IDG Communications, Inc.