Data in, intelligence out: Machine learning pipelines demystified

Data plus algorithms equals machine learning, but how does that all unfold? Let’s lift the lid on the way those pieces fit together, beginning to end

Data in, intelligence out: Machine learning pipelines demystified

It’s tempting to think of machine learning as a magic black box. In goes the data; out come predictions. But there’s no magic in there—just data and algorithms, and models created by processing the data through the algorithms.

If you’re in the business of deriving actionable insights from data through machine learning, it helps for the process not to be a black box. The more you understand what’s inside the box, the better you’ll understand every step of the process for how data can be transformed into predictions, and the more powerful your predictions can be.

Devops people speak of “build pipelines” to describe how software is taken from source code to deployment. Just as developers have a pipeline for code, data scientists have a pipeline for data as it flows through their machine learning solutions. Mastering how that pipeline comes together is a powerful way to know machine learning itself from the inside out.

Data sources and ingestion for machine learning

The machine learning pipeline consists of four phases, as described by Wikibon Research analyst George Gilbert:

  1. ingesting data
  2. preparing data (including data exploration and governance)
  3. training models
  4. serving predictions

A machine learning pipeline needs to start with two things: data to be trained on, and algorithms to perform the training. Odds are the data will come in one of two forms:

  1. Live data you’re already collecting and aggregating somewhere, which you plan on making regularly updated predictions with.
  2. A “frozen” data set, something you’re downloading and using as is, or deriving from an existing data source via an ETL operation.

With frozen data, you generally perform only one kind of processing: You train a model with it, deploy the model, and depending on your needs, you update the model periodically, if at all.

But with live or “streamed” data, you have two choices regarding how to produce models and results from the data. The first option is to save the data somewhere—a database, a “data lake”—and perform analytics on it later. The second option is to train models on streamed data as the data comes in.

Training on streamed data also takes two forms, as described by Charles Parker of machine learning solution provider BigML. One scenario is when you’re feeding a regular flow of fresh data to the model to make predictions, but you’re not adjusting the underlying model all that much. The other scenario is when you’re using the fresh data to train entirely new models every so often, because older data isn’t as relevant.

This is why choosing your algorithms early on is important. Some algorithms support incremental retraining, while others have to be retrained from scratch with the new data. If you will be streaming in fresh data all the time to retrain your models, you want to go with an algorithm that supports incremental retraining. Spark Streaming, for example, supports this use case.

Data preparation for machine learning

Once you have a data source to train on, the next step is to ensure it can be used for training. The catchall term for ensuring consistency in the data to be used is normalization.

Real-world data can be noisy. If the data is drawn from a database, you can assume a certain amount of normalization in the data. But many machine learning applications may also draw data straight from data lakes or other heterogeneous sources, where the data isn’t necessarily normalized for production use.

Sebastian Raschka, author of Python Machine Learning, has written in detail about normalization, and how to achieve it for some common types of data sets. The examples he uses are Python-centric, but the basic concepts can be applied universally.

Some environments for machine learning make normalization an explicit step. For instance, Microsoft’ Azure Machine Learning Studio has a discrete “Normalize Data” module that can be added to a given data experiment.

Is normalization always required? Not always, says Franck Dernoncourt, a PhD candidate in AI at MIT, in a detailed exploration of the subject on Stack Overflow. But as he puts it, “It rarely hurts.” The important thing to know, he points out, are the advantages of normalization for the specific use case. For artificial neural networks, normalization isn’t needed but can be useful. For building models with a K-means clustering algorithm, however, normalization is vital.

One area where normalization is undesirable is when “the scale of the data has significance,” according to Malik Magdon-Ismail, co-author of Learning from Data. An example: “If income is twice as important as debt in credit approval, it is appropriate for income to have twice the size as debt.”

Something else to be conscious of during the data intake and preparation phase is how biases can be introduced into a model by way of the data, its normalization, or both. Biases in machine learning have real-world consequences; it helps to know how to find and defeat such bias where it might exist. Never assume that clean (readable, consistent) data is unbiased data.

Training machine learning models

Once you have your data set established, next comes the training process, where the data is used to generate a model from which predictions can be made. You will generally try many different algorithms before finding the one that performs best with your data.  


I mentioned earlier that your choice of algorithm will depend not only on the type of problem being solved, but also on whether you want models that are trained all at once on a batch of data or models that are retrained incrementally. Another key aspect to training models is how to tune the training to increase the precision of the resulting model—what’s called hyperparameterization.

A hyperparameter for a machine learning model is a setting that governs how the resulting model is produced from the algorithm. The K-means clustering algorithm, for example, organizes data into groups based on similarities in the data. So, one hyperparameter for a K-means algorithm would be the number of clusters to search for.

Generally, the best choices for a hyperparameter come from having experience with the algorithm. Sometimes you need to try out a few variations and see which ones yield workable results for your problem set. That said, for some algorithm implementations, it’s becoming possible to automatically tune hyperparameters.

The Ray framework for machine learning, for example, has a hyperparameter optimization feature. Google Cloud ML Engine offers hyperparameter tuning options for training jobs. And a package named FAR-HO provides hyperparameter optimization tools for TensorFlow.


Many of the libraries for model training can take advantage of parallelism, which speeds the training process by distributing computation across multiple CPUs, GPUs, or nodes. If you’ve got access to the hardware to train in parallel, use it. The speedups are often near-linear for each additional computing device.

Parallel training may be supported by the machine learning framework you’re using to perform the training. The MXNet library, for example, lets you train models in parallel. MXNet also supports both of the key methodologies for parallelizing training, data parallelism and model parallelism.

Alex Krizhevsky, a member of the Google Brain Team, explained the differences between data parallelism and model parallelism in a paper about parallelizing network training. With data parallelism, “different workers train [models] on different data examples … [but] must synchronize model parameters (or parameter gradients) to ensure they are training a consistent model.” In other words, while you may split the data to train across multiple devices, you must keep the models produced by each node in sync with one another so they don’t yield markedly different prediction results. TensorFlow can be used in this fashion, with different strategies for synchronizing data between nodes.

With model parallelism, “different workers train different parts of the model,” but workers have to stay in sync whenever “the model part … trained by one worker requires output from a model part trained by another worker.” This approach is typically used when training a model involves multiple layers that feed into each other, such as a recurrent neural network.

It’s worth learning how to assemble pipelines using both of these approaches, because many frameworks, such as the Torch framework, now support both.

Deploying machine learning models

The last phase in the pipeline is deploying the trained model, or the “predict and serve” phase, as Gilbert puts it in his paper “Machine Learning Pipeline: Chinese Menu of Building Blocks.” This is where the trained model is run against incoming data to generate a prediction. For a face-recognition system, for example, the incoming data could be a headshot or a selfie, with predictions made from a model derived from other face photos.

Cloud deployment

Where and how this prediction is served constitutes another part of the pipeline. The most common scenario is providing predictions from a cloud instance by way of a RESTful API. All the obvious advantages of serving from the cloud come into play here. You can spin up more instances to satisfy demand, for example.

With a cloud-hosted model, you can also keep more of the pipeline in the same place—training data, trained models, and the prediction infrastructure. Data doesn’t have to be moved around as much, so everything is faster. Incremental retraining of a model can be done more quickly, because the model can be retrained and deployed in the same environment.

Client device deployment

Sometimes it makes sense to deploy a model on a client and serve predictions from there. Good candidates for this approach are mobile apps, where bandwidth is at a premium, and any app where a network connection isn’t guaranteed or reliable.

One caveat is that the quality of predictions made on a local machine may be lesser. The size of the deployed model may be smaller due to local storage constraints, and that might in turn affect prediction quality. That said, it’s becoming more feasible to deploy highly accurate models on modest devices like smartphones, mostly by way of a slight trade-off of accuracy for speed. It’s worth taking a look at the application in question and seeing if a locally deployed model, refreshed periodically, delivers acceptable accuracy. If so, then the app can serve predictions even when there’s no data connection.

Deploying models to a client points to another stumbling block. Because you can deploy models in so many places, the deployment process can be complex. There’s no consistent path from any one trained model to any one target hardware, OS, or application platform. This complexity is not likely to go away anytime soon, although the pressure to find a consistent deployment pipeline will undoubtedly increase, thanks to the growing practice of developing apps using machine learning models.

The machine learning pipeline today and tomorrow

The term pipeline implies a one-way, unbroken flow from one end to another. In reality, the machine learning flow is more cyclical: Data comes in, it is used to train a model, and then the accuracy of that model is assessed and the model is retrained as new data arrives and the meaning of that data evolves.

Right now, we don’t have much choice but to think of the machine learning pipeline as discrete stages that need individuated attention. Not because each stage is functionally different, but because there is little in the way of end-to-end integration for all these pieces. In other words, there is no pipeline, really—just a series of activities we tend to think of as a pipeline.

Data platform projects for machine learning

However, projects are coming together that attempt to fill this need for a real pipeline. Some are outgrowths of existing work by data platform vendors.

Hadoop vendor MapR, for example, provides the Distributed Deep Learning Quick Start Solution—a combination of a one-year, six-node license for the MapR Hadoop distribution, integrated neural network libraries with CPU/GPU support, and professional consulting services.

Hortonworks recently announced a way to use containers to deploy TensorFlow across a Hortonworks Data Platform (HDP) cluster. An end-to-end machine learning pipeline built with HDP would still have to be assembled by hand, but the use of containers would make the overall assembly of the pipeline easier.

In the same vein, MapR has work under way to create a lifecycle for data science projects via the microservices model. Much of this revolves around using containers and Kubernetes to organize and orchestrate training and prediction workloads, and using a Kubernetes volume driver to further separate compute from strorage.

Yet another lifecycle project currently in the works is MLflow, announced by Databricks, the chief corporate developers behind the Apache Spark project. MLflow covers three aspects of the machine learning lifecycle: tracking experiments (for instance, with varying hyperparameters), packaging code for re-use by others, and managing and deploying models. The MLflow project is not coupled to any particular machine learning framework or algorithm set, which makes it a promising foundation to build from. However, it is still considered an alpha-level offering.

Devops tools for data scientists

An ideal solution would be a complete open source design pattern that covers every phase of the machine learning pipeline and provides a seamless experience akin to the continuous-delivery systems that now exist for software. In other words, something that constitutes, as Wikibon’s Gilbert put it, “devops tools for data scientists.”

Baidu has announced it is looking into such a devops tool for data scientists, with Kubernetes as a chief element (something MapR also uses to coordinate work across nodes in its system), but nothing concrete has materialized yet. 

Until that day comes, we’ll have to settle for learning every bit of the pipeline from the inside out.

Copyright © 2018 IDG Communications, Inc.