PyTorch is a Python-based tensor computing library with high-level support for neural network architectures. It also supports offloading computation to GPUs. A product of Facebook’s AI research team and open sourced a little more than a year ago, PyTorch has fast become the first choice of many deep learning practitioners.

In this tutorial, we’ll dive into the basics of running PyTorch on Linux, from installation to creating and training a simple neural network that can recognize digits. We’ll cap it off by tackling a more complicated example that uses convolutional neural networks (CNNs) to improve accuracy. This won’t be a full introduction to neural networks, but I will explain neural networking concepts as they crop up in our code.

While a computer with a GPU is not necessary for this tutorial, it is recommended. If you want to follow along in a Jupyter notebook, you can make use of the version of this article on GitHub.

## Install PyTorch

The easiest way to install PyTorch is to use the Anaconda Python distribution. If you have Anaconda installed, you can get the latest PyTorch by entering this command:

`$ conda install pytorch torchvision -c pytorch`

If you would rather use Python’s pip, then for Python 2.6, enter these commands:

`$ pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl`

`$ pip install torchvision`

Or for Python 3.6, enter these commands:

`$ pip3 install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl`

`$ pip3 install torchvision `

Note that if you want to use GPU-accelerated calculations, you will need to have an Nvidia graphics card and the CUDA libraries installed.

## A first PyTorch model

With PyTorch installed, we’re going to do the “Hello world” of deep learning, which is creating a neural network that will examine the images of handwritten digits from the MNIST dataset and identify the numbers. Here is a look at some of the digits:

First we’ll need to get our hands on the dataset. While we could download these directly from the MNIST website and build scaffolding to load them into PyTorch, PyTorch allows us to download standard reference datasets like MNIST, CIFAR-10, COCO, and others without much fuss.

transforms = transforms.Compose([

transforms.ToTensor(),

transforms.Normalize((0.1307,), (0.3081,))])

train_loader = torch.utils.data.DataLoader(

datasets.MNIST(‘../data’, train=True, download=True,

transform=transforms),

batch_size=64, shuffle=True)

test_loader = torch.utils.data.DataLoader(

datasets.MNIST(‘../data’, train=False, transform=transforms),

batch_size=64, shuffle=True)

This code will create two `DataLoader`

objects that will download the MNIST dataset and serve up random batches of 64 images from MNIST’s collection of 60000. (They will download only if the data is not present, so the `test_loader`

`DataLoader`

will simply use the images that were downloaded by the `train_loader`

.) Note the transforms argument applied to both loaders. PyTorch’s torchvision package allows you to create a complex pipeline of transformations for data augmentation that are applied to images as they get pulled out of the `DataLoader`

, including random cropping, rotation, reflection, and scaling.

In our example, we’re not doing any of that, but we will take advantage of the pipeline to transform the image data into a tensor. (In MNIST’s case, this tensor is an array of 1x28x28, as the images are all grayscale 28x28 pixels.) We will also normalize that tensor to the standard deviation and mean of the MNIST dataset. This will takes us from an array of pixels ranging from 0 to 255 to a tensor of values ranging from -1 to 1. We do this because neural network training will do a lot better within the narrower range.

## A PyTorch neural network

Next let’s create our first neural network by creating a new Python class that inherits from PyTorch’s `nn.Module`

:

class FirstNet(nn.Module):

def __init__(self,image_size):

super(FirstNet, self).__init__()

self.image_size = image_size

self.fc0 = nn.Linear(image_size, 1000)

self.fc1 = nn.Linear(1000, 50)

self.fc2 = nn.Linear(50, 10)

def forward(self, x):

x = x.view(-1, self.image_size)

x = F.relu(self.fc0(x))

x = F.relu(self.fc1(x))

x = F.relu(self.fc2(x))

return F.log_softmax(x)

The general convention for these network classes is that you create all your layers in the constructor, then lay out their relationship in the `forward()`

method. Here we’re creating a very simple network where all our layers are linear, the classic “fully connected” neural network, which applies a linear translation to all input (the values in the layer are initialized randomly). The network starts with `image_size`

, the size of our MNIST images, and ends with 10 outputs, corresponding to the 10 digits (zero to nine) that we’re attempting to recognize.

The `forward()`

method shows us how an image flows through the network. First we convert the image tensor (1x28x28 once it comes through the transformation pipeline) into a shape that the first layer can understand. We do this via the `view()`

method, which in this case *flattens* the tensor into a shape of 1x784, the shape for the first linear layer.

The next three lines of code apply each layer to the incoming data in turn, but there’s also a `F.relu()`

call happening at each level. What is this? Well, it’s an example of an *activation function*. These functions can be applied to outputs of each layer and insert non-linearity into the system. Without them, we would essentially have a linear regression model. With them, the neural network gains the power of universal function approximators.

There are many different types of activation function, but most modern deep learning architectures will use the ReLU, or Rectified Linear Unit. While this sounds intimidating, it is literally just a function f(x) where f(x) = max(x,0). The function returns zero if the output is less than zero, or returns the original output if greater than zero.

Finally, we use a different activation, softmax, on the output of the final layer. Softmax squashes the output in the final layer to be in the range of 0 to 1 for each of the 10 output classes. These will become probability estimates for each class, so to determine the predicted class of an image, we find the class with the probability closest to 1.

Creating an instance of the network is done in the traditional Python way of calling the constructor:

model = FirstNet(image_size=28*28)

If you have a GPU-enabled machine, you can copy this model to the GPU by calling the `cuda()`

method:

model.cuda()

## PyTorch model training and testing

Having created our model, we now need to train it. In some frameworks, like Keras, most of the training is handled for you behind the scenes. In PyTorch, we need to write an explicit training procedure. Here is an example, taken from the PyTorch examples:

optimizer = optim.SGD(model.parameters(), lr=lr)

def train(epoch, model):

model.train()

for batch_idx, (data, labels) in enumerate(train_loader):

if torch.cuda.is_available():

data, labels = data.cuda(), labels.cuda()

data, labels = Variable(data), Variable(labels)

optimizer.zero_grad()

output = model(data)

loss = F.nll_loss(output, labels)

loss.backward()

optimizer.step()

if batch_idx % 100 == 0:

print(‘Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}’.format(

epoch, batch_idx * len(data), len(train_loader.dataset),

100. * batch_idx / len(train_loader), loss.data[0]))

There is a lot going on here, but it is fairly straightforward if we take it a line at a time. First, before we create the `train()`

method, we instantiate our optimizer, which will update the values of the layers of the neural network at each step through each batch from the `DataLoader`

. These values should become more accurate as training continues.

There are various different optimizers that you can choose from including RMSProp, AdaGrad, and the one most commonly used today, ADAM. But here we’ll use the classic vanilla Stochastic Gradient Descent with a learning rate of 0.001. The learning rate tells the optimizer how much to change the values in the layers at each pass. Set the learning rate too high, and your network may bounce around between high and low accuracy. Set it too low, and you may see training take a very long time. We’ll go with 0.001, which is a decent starting point.

In the `train()`

method, we first put the model in training mode and then loop through all the batches in the dataset. For each batch, we copy the image data and the labels (i.e. the digit the image represents) to the GPU and reset the optimizer for this batch.

The images in the batch are then passed through the model to generate the output tensor, our predictions. This output is then compared to the labels (the right answers) via a *loss function*. We’re using the negative log likelihood loss function here, which is commonly used in classification architectures.

We then invoke PyTorch magic. The call to `loss.backward()`

calculates the backpropagation, working out the gradient of the loss with respect to the values in the layers (or “weights”). Then by calling `optimizer.step()`

we adjust the layers using this gradient and the optimizer function. You can think of this process as a ball rolling across a hilly landscape, and we’re trying to get to the bottom. With each step, we nudge the network in the direction we think is down.

Finally, we print out some debugging information on some batch indices.

The `test()`

method, which I’m not showing here (feel free to take a look at the Jupyter notebook on GitHub), switches the model into evaluation mode, makes predictions, and reports the accuracy of the model. If we run the train/test cycle for 10 iterations (also known as *epochs*), we’ll get an accuracy in the area of 80 percent. Not bad, but we can do better without much effort.

## A PyTorch convolutional neural network

Most computer vision deep learning architectures these days are made up of stacks of *convolutional neural networks (CNNs)* instead of the fully connected layers shown above. A convolutional neural network can be thought of as a group of small filters that pass over the image. Each filter is trained to look for certain things, so one filter might end up recognizing eyes, another might seek out noses, and so on. Here is a very basic convolutional neural network in PyTorch:

class CNNNet(nn.Module):

def __init__(self):

super(CNNNet, self).__init__()

self.conv1 = nn.Conv2d(1, 10, kernel_size=5)

self.conv2 = nn.Conv2d(10, 20, kernel_size=5)

self.conv2_drop = nn.Dropout2d()

self.fc1 = nn.Linear(320, 50)

self.fc2 = nn.Linear(50, 10)

def forward(self, x):

x = F.relu(F.max_pool2d(self.conv1(x), 2))

x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))

x = x.view(-1, 320)

x = F.relu(self.fc1(x))

x = F.dropout(x, training=self.training)

x = self.fc2(x)

return F.log_softmax(x, dim=1)

If we re-initialize the optimizer, create a new model with this network, and run it for 10 epochs, we will suddenly improve accuracy to above 90 percent. Aside from the convolutional layers (`conv2d`

), the other new concepts introduced here are MaxPooling, which is a form of downsampling, and Dropout, which forces the network to randomly discount a number of activations when it is in training mode. This helps the model to train in a more generalizable fashion—i.e. to learn to discern the structure of a 1 instead of merely learning to recognize the pixel values from the training images.

## Next steps with PyTorch

That will do it for this tutorial. If you’re eager to learn more about the PyTorch framework, check out the PyTorch tutorials site for all sorts of examples, from image classification to translating text between different languages. If you’re looking to explore deep learning in general using PyTorch, I recommending having a look at the fast.ai course. It will take you through the theory of deep learning and its applications in a very accessible manner.