PyTorch review: A deep learning framework built for speed

PyTorch 1.0 shines for rapid prototyping with dynamic neural networks, auto-differentiation, deep Python integration, and strong support for GPUs

PyTorch review: A deep learning framework built for speed
At a Glance

Deep learning is an important part of the business of Google, Amazon, Microsoft, and Facebook, as well as countless smaller companies. It has been responsible for many of the recent advances in areas such as automatic language translation, image classification, and conversational interfaces.

We haven’t gotten to the point where there is a single dominant deep learning framework. TensorFlow (Google) is very good, but has been hard to learn and use. Also TensorFlow’s dataflow graphs have been difficult to debug, which is why the TensorFlow project has been working on eager execution and the TensorFlow debugger. TensorFlow used to lack a decent high-level API for creating models; now it has three of them, including a bespoke version of Keras.

CNTK (Microsoft) and Apache MXNet (Amazon) have been the principal competitors to TensorFlow, but there are other framework lineages to consider. Caffe (Berkeley Artificial Intelligence Research Lab), originally for image classification, was expanded and updated to Caffe2 (Facebook and others) and given strong production capabilities. Torch (Facebook, Twitter, Google, and others) uses Lua scripting and a CUDA (Compute Unified Device Architecture) C/C++ back end to efficiently solve problems in machine learning, computer vision, signal processing, and other fields. Despite its strengths as a scripting language, Lua became a liability to Torch when the bulk of the deep learning community adopted Python.

CUDA is Nvidia’s API for its general purpose GPUs. GPUs are much faster than CPUs for training and making predictions from deep neural networks; so are Google’s TPUs (tensor processing units) and FPGAs (field programmable gate arrays), which are available for use on AWS, Microsoft Azure, and elsewhere. In some cases, the use of advanced chips (GPUs, TPUs, or FPGAs) can speed up computations over CPUs by 50x per chip used, reducing training times from weeks to hours or from hours to minutes.

PyTorch (Facebook, Twitter, Salesforce, and others) builds on Torch and Caffe2, using Python as its scripting language and an evolved Torch CUDA back end. The production features of Caffe2 – highly scalable execution engine, accelerated hardware support, support for mobile devices, etc. – are being incorporated into the PyTorch project.

Tensors and neural networks in Python

PyTorch is billed by its developers as “Tensors and dynamic neural networks in Python with strong GPU acceleration.” What does that mean?

Tensors are a mathematical construct that is used heavily in physics and engineering. A tensor of rank two is a special kind of matrix; taking the inner product of a vector with the tensor yields another vector with a new magnitude and a new direction. TensorFlow takes its name from the way tensors (of synaptic weight, or the strength of connection between nodes) flow around its network model. NumPy also uses tensors, but calls them n-dimensional arrays (ndarray).

We’ve already discussed GPU acceleration. A dynamic neural network is one that can change from iteration to iteration. For example, a dynamic neural network model in PyTorch may add and remove hidden layers during training to improve its accuracy and generality. PyTorch recreates the graph on the fly at each iteration step. In contrast, TensorFlow by default creates a single dataflow graph, optimizes the graph code for performance, and then trains the model.

While eager execution mode is a fairly new option in TensorFlow, it’s the only way PyTorch runs: API calls execute when invoked, rather than being added to a graph to be run later. That might seem like it would be less computationally efficient, but PyTorch was designed to work that way, and it is no slouch when it comes to training or prediction speed.

PyTorch architecture

At a high level, the PyTorch library contains the following components:

PyTorch integrates acceleration libraries such as Intel MKL (Math Kernel Library) and the Nvidia cuDNN (CUDA Deep Neural Network) and NCCL (Nvidia Collective Communications) libraries to maximize speed. Its core CPU and GPU tensor and neural network back ends—TH (Torch), THC (Torch CUDA), THNN (Torch Neural Network), and THCUNN (Torch CUDA Neural Network)—are written as independent libraries with a C99 API. At the same time, PyTorch is not a Python binding into a monolithic C++ framework, but designed to be deeply integrated with Python and to allow the use of other Python libraries.

The memory usage in PyTorch is efficient compared to Torch and some of the alternatives. One of the optimizations is a set of custom memory allocators for the GPU, since available GPU memory can often limit the size of deep learning models that can be solved at GPU speeds.

PyTorch GPU support

CUDA GPU support in PyTorch goes down to the most fundamental level. In the example below, you see the code detecting a CUDA device, creating a tensor on the GPU, copying a tensor from CPU to GPU, adding the two tensors on the GPU, printing the result, and finally copying the result from GPU to CPU with a different data type and printing that result.

# Let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
    device = torch.device(“cuda”)          # a CUDA device object
    y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
    x =                       # or just use strings ``.to(“cuda”)``
    z = x + y
    print(“cpu”, torch.double))       # ``.to`` can also change dtype together!

tensor([ 1.9422], device=’cuda:0’)
tensor([ 1.9422], dtype=torch.float64)

In a higher-level scenario, you’d run an entire neural network training on the GPU. To begin with, you’d detect the first CUDA device, as above, and convert all your modules from CPU tensors to CUDA tensors: #this is a deep conversion of the whole neural network

You’ll have to send the inputs and targets at every step to the GPU, as well:

inputs, labels =,

Basically, you can move the entire computation to the GPU with just a few lines of code.

What about using multiple GPUs? DataParallel, a method of the nn neural network class, splits your data automatically and sends job orders to multiple models on several GPUs. After each model finishes its job, DataParallel collects and merges the results before returning it to you.

model = Model(input_size, output_size) # this is a neural network class
if torch.cuda.device_count() > 1:                        # can we go parallel?
  print(“Let’s use”, torch.cuda.device_count(), “GPUs!”)      # yes, we can
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)                         # use all available GPUs                                         # copy the model to the GPUs

PyTorch automatic gradient computation (autograd)

PyTorch has the ability to snapshot a tensor whenever it changes, allowing you to record the history of operations on a tensor and automatically compute the gradients later. You enable this by setting the requires_grad property when you create the tensor:

x = torch.ones(2, 2, requires_grad=True)

Subsequent operations with a tensor that requires gradients may create new tensors, and those will inherit the requires_grad flag. You can change the requires_grad flag in place on a tensor at any time with the requires_grad_(…) method. In PyTorch, a trailing underscore on a method name such as requires_grad_ means that it updates the tensor in place; methods without the trailing underscore generate a new tensor.

How do those snapshots help compute gradients? Basically, the framework approximates the gradient at every saved tensor by looking at the differences between that point and the previous tensor. This is less accurate, but roughly three times more efficient per variable parameter, than evaluating deltas around each state to get the derivatives. If the step size is small, the approximation won’t be too bad.

In PyTorch, you compute the gradient using backpropagation (backprop) by calling the tensor’s backward() method, as shown in this animation, after clearing out any existing gradients from the neural network’s buffers. Then you can use that to update the weight tensor. In short, PyTorch programs create a graph on the fly. Then back-propagation uses the dynamically created graph, automatically calculating the gradients from the saved tensor states.

PyTorch optimizers

Most of the weight update rules (optimizers) used to find the minimum error take the gradient of the loss function as the initial direction to change the values for the next step, multiplied by a small learning rate to reduce the magnitude of the step. The basic algorithm is called steepest descent. For machine learning, the usual variant is stochastic gradient descent, or SGD, which uses multiple batches of data points and often goes through the data multiple times (epochs).

More sophisticated versions of stochastic gradient descent, for example Adam and RMSprop, may compensate for biases, fold in momentum and velocity with the gradient, average gradients, or use adaptive learning rates. PyTorch currently supports 10 optimization methods.

PyTorch neural networks

The torch.nn class defines modules and other containers, module parameters, 11 kinds of layers, 17 loss functions, 20 activation functions, and two kinds of distance functions. Each kind of layer has many variants, for example six convolution layers and 18 pooling layers.

The torch.nn.functional class defines 11 categories of functions. Somewhat confusingly, both torch.nn and torch.nn.functional contain loss and activation member functions. In many cases, however, the torch.nn member is little more than a wrapper for the corresponding torch.nn.functional member.

You can define your own custom models as subclasses of nn.Module. For example:

import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)
    def forward(self, x):
       x = F.relu(self.conv1(x))
       return F.relu(self.conv2(x))

This very simple model has two 2D convolution layers, and uses a rectified linear unit (ReLU) activation function for both layers. The three parameters to nn.Conv2d are the number of input channels, the number of output channels, and the size of the convolving kernel.

You can also use one of the container modules to group your layers into a model. The nn.Sequential container adds modules in the order they are listed in the constructor:

# Example of using Sequential
model = nn.Sequential(
# Example of using Sequential with OrderedDict
model = nn.Sequential(OrderedDict([
          (‘conv1’, nn.Conv2d(1,20,5)),
          (‘relu1’, nn.ReLU()),
          (‘conv2’, nn.Conv2d(20,64,5)),
          (‘relu2’, nn.ReLU())

The nn.ModuleList container is good for the case where you want to generate enumerable lists of layers from code. For example, look at this use of 10 linear layers: 

class MyModule(nn.Module):
    def __init__(self):
        super(MyModule, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(10, 10) for i in range(10)])
    def forward(self, x):
        # ModuleList can act as an iterable, or be indexed using ints
        for i, l in enumerate(self.linears):
            x = self.linears[i // 2](x) + l(x)
        return x

PyTorch examples

The pytorch/examples repo contains worked-out models for MNIST digit classification using convolutional neural networks; word-level language modeling using LSTM RNNs; ImageNet image classification using residual networks; LSUN scene understanding using deep convolutional generative adversarial networks (DCGAN); variational auto-encoder networks; image super-resolution using an efficient sub-pixel convolutional neural network; artistic style transfer using perceptual loss functions; training a CartPole to balance in OpenAI Gym with actor-critic models; SNLI natural language inference with global vectors for word representation (GloVe), LSTMs, and torchtext; and time sequence prediction (sine wave signal values) using LSTMs.

At a Glance
  • Pros

    • NumPy-like tensor computations with strong GPU support
    • Dynamic neural networks
    • Tape-based automatic differentiation
    • Compatible with standard Python libraries
    • Strong collection of neural network layers, optimization algorithms, and loss functions


    • Production MacOS binaries don’t support GPUs
    • Building from source can be tricky
1 2 Page 1
Page 1 of 2