Handicapping the AI modeling horse race

AI has become a core focus for application developers everywhere. It’s as hot in the consumer space as it is in business, industry, research, and government

Handicapping the AI modeling horse race
Thinkstock

One key milestone of tech market maturation is when leading alternatives narrow to a two-way horse race. That now describes the market for AI modeling frameworks, which are the environments within which data scientists build and train statistically driven computational graphs.

The AI modeling horse race narrows to TensorFlow vs. PyTorch

The clear leaders in AI modeling framework are now the Google-developed TensorFlow and the Facebook-developed PyTorch, and they’re pulling away from the rest of the market in usage, share, and momentum.

Though TensorFlow still has the predominant market share among working data scientists, PyTorch has come along fast among key user segments. According to this recent study, PyTorch has become the overwhelming favorite of data scientists in academic and other research positions; whereas TensorFlow continues to have strong adoption by enterprise AI, deep learning, and machine learning developers. PyTorch has built its following on such strengths as seamless integration with the Python ecosystem, a better designed API, and better performance for some ad-hoc analyses.

Going forward, most working data scientists will probably use some blend of TensorFlow and PyTorch in most of their work, and these will be available in most commercial data scientist workbenches. The most recent feature refreshes to both frameworks are rather underwhelming, as befits a market in which core functions are well defined and users prize feature parity over strong functional differentiation.

TensorFlow 2.0: Google has improved TensorFlow’s API, removing redundant symbols, providing consistent naming conventions, and recommending Keras as the principal high-level API for ease of use. The vendor has introduced by-default eager execution, which enables TensorFlow developers to immediately inspect how their changes to variables and other model components impact model performance. Developers can now create a single model that can then be deployed to browsers, mobile devices, and on servers through add-on frameworks TensorFlow.js and TensorFlow Lite. It now delivers as much as three times faster training performance using mixed precision on Volta and Turing GPUs with a few lines of code.

Google has promised framework updates in the near future that will integrate an intermediate representation compiler for the purpose of exporting models for easy execution in non-TensorFlow back ends and on a wide range of hardware targets.

PyTorch 1.3: Facebook has added support for quantization, which is the ability to encode a PyTorch model for reduced-precision inference on either server or mobile devices, as well as the ability to quantize a model after it has been trained. The framework now supports named tensors, which enables writing cleaner machine-learning code. Furthermore, Facebook has added support for Google Cloud tensor processing units to facilitate speedier training of machine learning models. The vendor has introduced PyTorch Mobile for deploying machine learning models on edge devices starting with Android and iOS devices; CryptTen, a tool for encrypted machine learning in PyTorch; and Captum, a tool for machine learning model explainability.

Google and Facebook are investing heavily in evolving their respective frameworks to address a growing range of sophisticated AI modeling requirements. Google’s commitment to evolving TensorFlow is clear from the wide range of announcements it made both at this year’s TensorFlow Developer Summit and last year’s. Facebook has rolled out a number of recent PyTorch enhancements to narrow the gap vis-à-vis TensorFlow in platform support, functional coverage, and performance. We can expect to see a regular pulse of such enhancements going forward, with a clear focus on Facebook’s annual F8 developer conference.

Legacy AI frameworks slowly slide into oblivion

This market consolidation didn’t seem inevitable two years ago. Although TensorFlow was already a runaway success at that time, it appeared that other open source frameworks, such as AWS-developed Apache MXNetMicrosoft Cognitive Toolkit (CNTK), and the Facebook-developed Caffe2, might join it in the upper tier of adoption among working data scientists.

Though in existence then, PyTorch was one of a pack of contenders that included the likes of Apache Singa, BigDL, Chainer, DeepDist, Deeplearning4j, DistBelief,  Distributed Deep Learning, DLib, DyNet, OpenDeep, OpenDL, OpenNN, PaddlePaddle, PlaidML, Sonnet, and Theano. Most of these frameworks are still around and are used in various industries for disparate AI, deep learning, and machine learning projects. However, mentions of them in the data science literature are increasingly few and far between.

As Facebook has strengthened its focus on PyTorch, it has deliberately shifted away from investing in the predecessor framework Torch, or in Caffe2, another open source framework it had previously brought to market. Operationally, Facebook now builds and trains most of its AI models in PyTorch, just as Google has built its vast AI-enable cloud computing infrastructure on TensorFlow.

It would not be surprising if most competitors disappear during the next few years in favor of whatever predominant frameworks emerge from the current ferment. PyTorch is certainly on a roll, and its growing ubiquity among researchers practically guarantees that they’ll bring it into their AI jobs, startups, and other commercial activities in the future.

Some observers even believe that the growing popularity of PyTorch presages the eventual decline of TensorFlow. I believe that prediction is premature for several reasons:

  • TensorFlow does not require using Python, which has a server runtime with more overhead than some enterprises are willing to run on production servers.
  • TensorFlow Lite is embeddable on mobile binaries, and Python interpreters are not.
  • TensorFlow Serving has features such as no-downtime model update, seamless switching between models, and batching at prediction time, which PyTorch lacks.

The diminishing importance of stand-alone AI modeling frameworks

Many AI tool vendors now provide framework-agnostic modeling platforms, which may offer a new lease on life for older frameworks in danger of dying out. These open modeling environments—implemented in today’s leading data science workbenches—support TensorFlow, PyTorch, and other popular open source frameworks.

Accelerating the spread of open AI modeling platforms is industry adoption of several abstraction layers that, taken together, enable a model built in one framework’s front end to be executed in any other supported framework’s back end. These open abstraction layers consist of:

  • High-level modeling interfaces: Keras, which runs on top of the TensorFlow and other AI back ends, provides a high-level Python API for fast prototyping and programming of AI models. Gluon framework, developed by AWS and Microsoft, defines a Python API for simplified AI programming on MXNet, CNTK, and potentially other AI back ends. Furthermore, PlaidML provides an open source abstraction layer that runs on development platforms with OpenCL-capable GPUs from Nvidia, AMD, or Intel. PlaidML builds on the Keras API and, according to its developer Vertex.AI, will be integrated with TensorFlow, PyTorch, and Deeplearning4j libraries and back ends.
  • Shared model representations: Open Neural Network Exchange (ONNX) provides a shared model representation that is supported on many AI modeling frameworks. The Keras API enables cross-tool sharing of computational graphs that were defined in that framework. Multi-Level Intermediate Representation is a newer specification, providing a representation format and library of compiler utilities that sits between the model representation and low-level compilers/executors that generate hardware-specific code.
  • Cross-framework model compilers: Several industry specifications enable an AI model created in one front-end tool to be automatically compiled for efficient execution on heterogeneous back-end platforms and chip sets. These initiatives include NNVM Compiler, NGraph, XLA, and TensorRT 3. The Open Neural Network Specification provides a retargetable compilation framework transforming ONNX models into optimized binary forms for inferencing on target platforms.

Before long, it will become next to irrelevant which front-end modeling tool you use to build and train your machine learning model. No matter where you build your AI, the end-to-end data science pipeline will automatically format, compile, containerize, and otherwise serve it out for optimal execution anywhere from cloud to edge.

Copyright © 2019 IDG Communications, Inc.