Data in, intelligence out: Machine learning pipelines demystified

Data plus algorithms equals machine learning, but how does that all unfold? Let’s lift the lid on the way those pieces fit together, beginning to end

1 2 Page 2
Page 2 of 2

Yet another lifecycle project currently in the works is MLflow, announced by Databricks, the chief corporate developers behind the Apache Spark project. MLflow covers three aspects of the machine learning lifecycle: tracking experiments (for instance, with varying hyperparameters), packaging code for re-use by others, and managing and deploying models. The MLflow project is not coupled to any particular machine learning framework or algorithm set, which makes it a promising foundation to build from. However, it is still considered an alpha-level offering.

Devops tools for data scientists

An ideal solution would be a complete open source design pattern that covers every phase of the machine learning pipeline and provides a seamless experience akin to the continuous-delivery systems that now exist for software. In other words, something that constitutes, as Wikibon’s Gilbert put it, “devops tools for data scientists.”

Baidu has announced it is looking into such a devops tool for data scientists, with Kubernetes as a chief element (something MapR also uses to coordinate work across nodes in its system), but nothing concrete has materialized yet. 

Until that day comes, we’ll have to settle for learning every bit of the pipeline from the inside out.

Copyright © 2018 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2