With the latest preview of Microsoft SQL Server 2016, Microsoft tipped its hand about its plans to integrate the R language with its products.
Joseph Sirosh, Corporate Vice President of the Data Group at Microsoft, outlined in a blog post the major features unveiled in SQL Server 2016 in its latest Community Technology Preview (CTP). Most of the discussion referred back to earlier announcements about the capabilities planned for the product, but among the genuinely new announcements is SQL Server R Services.
R's reputation as a powerful framework for math and statistics has grown in recent years, concomitant with the explosion of popular interest in data science. Microsoft intends to allow R to be used in SQL Server "to build intelligent applications and discover new insights about your business," since the product "enables intelligent applications to be built by hosting analytical models in the database, while reducing complexity and overall costs by moving expensive analytic computations close to the data."
Though it sounds great, there's a potential wrinkle. It's been generally assumed Microsoft would run R through its acquisition of Revolution Analytics, makers of a widely used implementation of the language. The product is licensed under the GPLv2, so it'll be hard to integrate directly with Microsoft's product line. Maybe R could be rewritten from the ground up or at least integrated into Azure as a service. But how -- if at all -- would it become part of on-prem products like SQL Server?
The easy way to do so and still respect the terms of the GPL is via an "arm's length" provision. Instead of integrating the R binaries directly into SQL Server, Microsoft could invoke the R software from its own executable as needed, then return results to the host application through a file or a data stream.
It's easy to see how such a transaction would impact performance, although "fast predictive in-database analytics" is one of the claims made by Microsoft for the Advanced Analytics feature set in SQL Server.
When asked about these issues, a Microsoft spokesperson provided details over email: "Microsoft allows customers to download RRO, which is an enhanced distribution of Open Source R. The new extensibility mechanism in SQL Server 2016 allows for secure execution of the R runtime provided with RRO as well as additional R packages provided with RRE for improved performance and scale."
What would this mean for performance? "It depends on the load," Microsoft replied, "which is why it’s in CTP 3.0 for testing/optimization right now." R in its current incarnation poses "certain challenges when it comes to performance and scale," in part because the open source incarnation of R is currently single-threaded and can only accommodate data sets that fit in memory.
Some of this, Microsoft said, can be offset by using SQL Server's in-memory engine for more common work and offloading actual statistics work to R. But Microsoft also plans to offer more native R functionality in SQL Server over time.
Many of the other features in SQL Server 2016 mesh with Microsoft's overall plan to make its on-premises products into complementary extensions of its cloud, and vice versa. The previously announced Stretch Database feature is an obvious example. Databases hosted in a local SQL Server 2016 instance can be extended into Azure -- but transparently, without the need for an application to be aware of it, and with the database admin controlling the amount of data flowing into Azure.
Likewise with R, Microsoft wants to offer a mix of on-prem and Azure-based solutions, each complementing a different use case. Microsoft has already unveiled R in Azure. On-prem users of SQL Server Enterprise Edition, though, could use RRO or the higher-performing, better-scaling RevoScaleR package, which according to Microsoft "includes implementations of common R functions that were designed for scale."
What's also clear: R within SQL Server, and Microsoft generally, is still a work in progress. A big part of Microsoft's motivation for introducing R as part of the CTP was for "fine-tuning and optimizing performance," as it "[expects] to introduce improvements between SQL Server 2016 CTP 3.0 and GA."