R and Python drive SQL Server 2017 into machine learning

Microsoft expands R support and adds Python for developers who aren’t also data scientists

R and Python drive SQL Server 2017 into machine learning
Credit: Blackberry QNX

Microsoft last week announced a wave of new features for its data platform, along with the SQL Server 2017 name and what Microsoft calls a “production quality” beta release. Other important changes include a new containerized deployment model for databases, which simplifies installation on Windows and Linux.

But it was SQL Server’s new machine learning tools that grabbed my attention.

Machine learning remains one of Microsoft’s big themes for 2017, and it’s an important segment of SQL Server 2017. Mixing code and data has always been part of SQL Server, first with T-SQL, then with the Azure-focused U-SQL, which extended T-SQL with C# elements. SQL Server 2016 added support for embedded R code, and SQL Server 2017 continues that evolution by improving its support for R and adding Python. (By renaming SQL Server 2016’s R Services to Machine Learning Services in SQL Server 2017, Microsoft has made clear where it’s aiming its SQL tools.)

Including R and Python in SQL Server works well for both the existing SQL Server audience and for data scientists who are unlikely to have experience with T-SQL. The two languages have become important data science tools, with statistical analysis baked deep into their DNA. R remains clearly focused on statistical analysis, while Python adds statistical tools to a popular and flexible scripting language.

With R aimed at statistical analysis experts, Python is perhaps the easiest on-ramp to analytical programming for the rest of us, especially with a wide choice of relevant packages that add new data analysis features to a familiar language.

With Python inside SQL Server, you can bring existing data and code together. Data is accessible directly, so there’s no need to extract query data sets, moving from storage to application. It’s a useful approach, especially where there are issues of data sovereignty and compliance. Your code runs inside the SQL Server security boundaries, triggered by a single call from T-SQL stored procedures.

Installing the Python options adds not only a Pythion interpreter, but also several commonly used machine learning tools, including a subset of the Anaconda Python-based data science tools and Microsoft’s own RevoScalePy package.

Anaconda is an interesting choice because it comes from a big data background. Designed for working with Hadoop and Amazon S3 clusters, it includes packages for visualization and machine learning. There’s a lot in it, probably more than you’re going to initially use inside SQL Server. Because it’s one of the more popular big data tool sets for Python, its inclusion lets data scientists quickly bring their existing skills (in concert with Hadoop) and code to SQL Server.

If you want or need extras, use Python’s built-in package-management tools to download more. You can even opt for SQL Server’s Python tools to work large-scale open source machine learning packages like Microsoft’s Cognitive Toolkit and Google’s TensorFlow, adding GPU compute to the mix.

Running Python code inside SQL Server lets you take advantage of numerous Microsoft performance and scaling features, with direct access to its in-memory database features, speeding up OLAP queries. Because the code runs as stored procedures, database developers don’t have to be Python experts; they simply bring existing code into the database and let data scientists do what they’re best at while ensuring the data remains secure.

To get started, developers can work with a set of data extracts. Once written, the same code can run locally, in on-premises databases, and in the cloud.

Adding support for both languages in its main data platform is a logical move for Microsoft. Because it runs both on-premises and in the cloud (and now on Linux and MacOS), SQL Server can work with not only the traditional big data sources, but with all your data. Because it builds on existing R support that arrived with SQL Server 2016, SQL Server developers and admins aren’t starting from scratch.

Another, more commercial aspect of these new capabilities is giving databases feature parity with statistical analysis tooling like SAS. Microsoft wants you to ask: Why should you install a hefty and expensive third-party system, when all you need is already in your database? Especially when you add further intelligence via templated solutions? The beta version of SQL Server 2017 is a taste of what Microsoft hopes to bring to answer that question in its favor.