Netflix open sources data science management tool

Metaflow manages Python data science projects end-to-end, works with any machine learning library, and integrates with AWS cloud services

Netflix open sources data science management tool

Netflix has open sourced Metaflow, an internally developed tool for building and managing Python-based data science projects. Metaflow addresses the entire data science workflow, from prototype to model deployment, and provides built-in integrations to AWS cloud services. 

Machine learning and data science projects need mechanisms to track the development of the code, data, and models. Doing all of that manually is error-prone, and tools for source code management, like Git, aren’t well-suited to all of these tasks.

Metaflow provides Python APIs to the entire stack of technologies in a data science workflow, from access to the data through compute resources, versioning, model training, scheduling, and model deployment.

According to Metaflow’s introductory documentation, Netflix built Metaflow to provide its own data scientists and developers with “a unified API to the infrastructure stack that is required to execute data science projects, from prototype to production,” and to “focus on the widest variety of ML use cases, many of which are small or medium-sized, which many companies face on a day to day basis.”

Metaflow does not favor any particular machine learning framework or data science library. Metaflow projects are just Python code, with each step of a project’s data flow represented by common Python programming idioms. Each time a Metaflow project runs, the data it generates is given a unique ID. This lets you access every run—and every step of that run—by referring to its ID or user-assigned metadata. 

Netflix recommends running Metaflow on AWS. The company offers a sandboxed version of Metaflow there (with restrictions on storage and data lifetime) for developers to experiment with the framework. 

The first public release of Metaflow, Metaflow 2.0, lacks some of the features Netflix uses internally, such as support for the R language or in-memory processing of large data by way of DataFrames. But Netflix is willing to make those features available if their corresponding GitHub issues attract enough support.

Copyright © 2019 IDG Communications, Inc.