GPipe and PipeDream: Scaling AI training in every direction

New frameworks Google GPipe and Microsoft PipeDream join Uber Horovod in distributed training for deep learning

GPipe and PipeDream: Scaling AI training in every direction

Data science is hard work, not a magical incantation. Whether an AI model performs as advertised depends on how well it’s been trained, and there’s no “one size fits all” approach for training AI models.

The necessary evil of distributed AI training

Scaling is one of the trickiest considerations when training AI models. Training can be especially challenging when a model grows too resource hungry to be processed in its entirety on any single computing platform. A model may have grown so large it exceeds the memory limit of a single processing platform, or an accelerator has required developing special algorithms or infrastructure. Training data sets may grow so huge that training takes an inordinately long time and becomes prohibitively expensive.

Scaling can be a piece of cake if we don’t require the model to be particularly good at its assigned task. But as we ramp up the level of inferencing accuracy required, the training process can stretch on longer and chew up ever more resources. Addressing this issue isn’t simply a matter of throwing more powerful hardware at the problem. As with many application workloads, one can’t rely on faster processors alone to sustain linear scaling as AI model complexity grows.

Distributed training may be necessary. If the components of a model can be partitioned and distributed to optimized nodes for processing in parallel, the time needed to train a model can be reduced significantly. However, parallelization can itself be a fraught exercise, considering how fragile a construct a statistical model can be.

The model may fail spectacularly if some seemingly minor change in the graph—its layers, nodes, connections, weights, hyperparameters, etc.—disrupts the model’s ability to make accurate inferences. Even if we leave the underlying graph intact and attempt to partition the model’s layers into distributed components, we will then need to recombine their results into a cohesive whole.

If we’re not careful, that may result in a recombined model that is skewed in some way in the performance of its designated task.

New industry frameworks for distributed AI training

Throughout the data science profession, we continue to see innovation in AI model training, with much of it focusing on how to do it efficiently in multiclouds and other distributed environments.

In that regard, Google and Microsoft recently released new frameworks for training deep learning models: Google’s GPipe and Microsoft’s PipeDream. The frameworks follow similar scaling principles.

Though different in several respects, GPipe and PipeDream share a common vision for distributed AI model training. This vision involves the need to:

  • Free AI developers from having to determine how to split specific models given a hardware deployment.
  • Train models of any size, type, and structure on data sets of any scale and format, and in a manner that is agnostic to the intended inferencing task.
  • Partition models so that parallelized training doesn’t distort the domain knowledge (such as how to recognize a face) that it was designed to represent.
  • Parallelize training in a fashion that is agnostic to the topology of distributed target environments.
  • Parallelize both the models and the data in a complex distributed training pipeline.
  • Boost GPU (graphics processing units) compute speeds for various training workloads.
  • Enable efficient use of hardware resources in the training process. 
  • Reduce communication costs at scale when training on cloud infrastructure.

Scaling training when models and networks become extremely complex

What distinguishes these two frameworks is the extent to which they support optimized performance of training workflows for models with sequential layers (which is always more difficult to parallelize) and in more complex target environments, such as multicloud, mesh, and cloud-to-edge scenarios.

Google’s GPipe is well suited for fast parallel training of deep neural networks that incorporate multiple sequential layers. It automatically does the following:

  • Partitions models and moves the partitioned models to different accelerators, such as GPUs or TPUs (Tensor processing units), which have special hardware that has been optimized for different training workloads.
  • Splits a mini-batch of training examples into smaller micro-batches that can be processed by the accelerators in parallel.
  • Enables internode distributed learning by using synchronous stochastic gradient descent and pipeline parallelism over a distributed machine learning library.

Microsoft’s PipeDream also exploits model and data parallelism, but it’s more geared to boosting performance of complex AI training workflows in distributed environments. One of the AI training projects in Microsoft Research’s Project Fiddle initiative, PipeDream accomplishes this automatically because it can:

  • Separate internode computation and communication in a way that leads to easier parallelism of data and models in distributed AI training.
  • Partition AI models into stages that consist of a consecutive set of layers.
  • Map each stage to a separate GPU that performs the forward and backward pass neural network functions for all layers in that stage.
  • Determine how to partition the models based on a profiling run that is performed on a single GPU.
  • Balance computational loads among different model partitions and nodes, even when a training environment’s distributed topology is highly complex.
  • Minimize communications among the distributed worker nodes that handle the various partitions, given that each worker has to communicate only to a single other worker and to communicate only subsets of the overall model’s gradients and output activations.

Further details on both frameworks are in their respective research papers: GPipe and PipeDream.

The need for consensus and scalability

Training is a critical feature of AI’s success, and more AI professionals are distributing these workflows across multiclouds, meshes, and distributed edges.

Going forward, Google and Microsoft should align their respective frameworks into an industry consensus approach for distributed AI training. They might want to consider engaging Uber in this regard. The ride sharing company already has a greater claim to the first-to-market distinction in distributed training frameworks. It open sourced its Horovod project three years ago. The project, which is hosted by the Linux Foundation’s AI Foundation, has been integrated with leading AI modeling environments such as TensorFlow, PyTorch, Keras, and Apache MXNet.

Scalability should be a core consideration of any and all such frameworks. Right now, Horovod has some useful features in that regard but lacks the keen scaling focus that Google and Microsoft have built into their respective projects. In terms of scalability, Horovod can run on single or multiple GPUs, and even on multiple distributed hosts without code changes. It is capable of batching small operations, automating distributed tuning, and interleaving communication and computation pipelines.

Scalability concerns will vary depending on what training scenario you consider. Regardless of which framework becomes dominant—GPipe, PipeDream, Horovod, or something else—it would be good to see industry development of reference workflows for distributed deployment of the following specialized training scenarios:

  • Semisupervised learning that uses small amounts of labeled data (perhaps crowdsourced from human users in mobile apps) to accelerate pattern identification in large, unlabeled data sets, such as those ingested through IoT devices’ cameras, microphones, and environmental sensors.
  • Reinforcement learning that involves building AI modules, such as those deployed in industrial robots, that can learn autonomously with little or no “ground truth” training data, though possibly with human guidance.
  • Collaborative learning that has distributed AI modules, perhaps deployed in swarming drones, that collectively explore, exchange and exploit optimal hyperparameters, thereby enabling all modules to converge dynamically on the optimal trade-off of learning speed versus accuracy.
  • Evolutionary learning that trains a group of AI-driven entities (perhaps mobile and IoT endpoints) through a procedure that learns from an aggregate of self-interested decisions they make, based both on entity-level knowledge and on varying degrees of cross-entity model-parameter sharing.
  • Transfer learning that reuses any relevant training data, feature representations, neural-node architectures, hyperparameters and other properties of existing models, such as those executed on peer nodes.
  • On-device training that enables apps to ingest freshly sensed local data and rapidly update the specific AI models persisted in those devices.
  • Robot navigation learning that works with raw sensory inputs, exploits regularities in environment layouts, and requires only a little bit of training data.

This list doesn’t even begin to hint at the diversity of distributed AI training workflows that will be prevalent in the future. To the extent that we have standard reference frameworks in place in 2020, data scientists will have a strong foundation for carrying the AI revolution forward in every direction.

Copyright © 2020 IDG Communications, Inc.