Machine learning: The deplorable state of deployment

A failed standard, complex alternatives -- and a way forward.

machine learning ai
Thinkstock

We’ve gotten pretty good at building machine learning models. From legacy platforms like SAS to modern MPP databases and Hadoop clusters, if you want to train up regression or classification models, you're in great shape.

In contrast, deploying those models is a face-meltingly painful experience. This despite the fact that machine learning models are primarily only useful to a business insofar as they're deployed into operational systems that influence the business' behavior.

Think of self-driving cars. Teams of engineers and scientists at companies like Tesla and Google have worked for years to train models for lane maintenance and collision avoidance using a broad array of machine learning techniques. Ultimately, though, engineers deploy those models into thousands of cars that can then react to real-time conditions on the road. Without that deployment step, the extensive efforts of those engineers and scientists would have little real-world value.

So how did we arrive at such a striking disparity between training and deployment in enterprise data science?

There are a few causes. The first is a simple dependency: You need to build a model before you can deploy it, so naturally tools for building will precede those for deploying. Second, the execution demands of deployment (typically real-time inference on streams of data points) are quite different from the demands of training (typically batch jobs on massive historical data sets). Third, deployment systems benefit from a single representation of their compute tasks, whereas data scientists prefer to use a diversity of tools, such as R, Python, SAS and Spark -- whatever is best for the job.

It's these latter two that are most pernicious. The culture of machine learning in the enterprise is centered around data scientists, but their needs are antithetical to those of operations engineers. In practice, this has led to two common approaches to machine learning deployment, both of which have substantial drawbacks.

Fortunately, there is a new approach that bridges this divide much more gracefully, and I'd like to advocate for it after discussing the failings of the prevailing methodologies.

The first common approach to deploying machine learning is to lower the friction of deploying data scientists' code directly. This approach is exemplified by technologies like Microsoft's DeployR or SAS's Model Manager, or, more recently, deploying Python or R runtimes in Docker containers.

Unfortunately, this deployment methodology is specific to the language or platform of origin, meaning that if a data scientist wants to use a new language or platform it necessitates a new, distinct path to production. In some cases it may be possible to standardize an entire data science team on one set of tools, but the cost of doing so is substantial in terms of recruitment and retention -- show me a talented data scientist who doesn't like to explore and leverage new technologies and techniques.

The second approach is to have a dedicated translational engineering team that takes models written in a variety of source languages and converts them into a standard production language and execution environment. This solves the homogeneity problem for ops and leaves data scientists free to use the tools of their choice, but introduces a substantial cost in between: The translation process itself is costly, time-consuming and error-prone.

Translation engineers need to be versed in both the source and target languages, understand data science fundamentals, and most importantly not introduce inaccuracies in the translation. It's notoriously challenging to quickly understand someone else's code, not to mention write error-free code in the first place. In most cases I've seen this translational step to be months-long, from final data science model to deployment.

In the world of computer science there's a pretty standard solution to problems with this structure: Introduce an intermediate representation. Unfortunately, the intermediate representation of record in machine learning, PMML, is insufficient for the task at hand.

Initially developed in the late '90s, PMML is an XML-based, industry-standard format for trained machine learning models. It accomplishes its goal of representing common modeling and data preprocessing techniques in a language- and execution-agnostic fashion, but it fails as a uniform deployment methodology due to its lack of expressiveness. One can only represent models in PMML that are baked into the standard itself.

Modern production machine learning systems do often contain common inference and preprocessing techniques that PMML can represent, but they also contain data transformation and feature engineering steps that are particular to the problem at hand, and modeling approaches that have not yet been canonized in the PMML standard. Thus we commonly see PMML-based deployment strategies that are decorated with Python scripts or other kludges that cover those functional gaps but detract from the raison d’être of the intermediate representation: a single description of the computation that ops needs to manage in production.

Because of its restriction to a canned set of functionality, PMML is a failed standard. Happily, the Data Mining Group, the standards body behind PMML, has drafted a successor standard called the Portable Format for Analytics (PFA) that addresses several of PMML's shortcomings.

First and foremost, it can express any computation -- from ETL routines to business rules to model inference. Second, it provides state (essentially a modifiable context) for each engine, so one can alter the behavior of an inference function based on the current value of, say, a market volatility index. Finally, it uses the more human-friendly JSON and YAML formats for serialization and has some strong early support from the Open Data Group.

I believe an approach to machine learning deployment that's based on an industry standard, language-agnostic, and able to represent a broad range of algorithms is the clear path forward. And it's urgent -- if the work of data scientists never gets deployed into operational processes, it will deliver little value and reinforce the nascent sense of disillusionment in the market concerning data science and machine learning.

Copyright © 2016 IDG Communications, Inc.