Build and Train a Neural Network

Neural networks can be constructed using the torch.nn package.

An nn.Module contains layers, and a method forward(input) that returns the output.

A typical training procedure for a neural network is:

  1. Define the neural network that has some learnable parameters (or weights)

  2. Iterate over a dataset of inputs

  3. Process input through the network

  4. Compute the loss (how far is the output from being correct)

  5. Propagate gradients back into the network’s parameters

  6. Update the weights of the network, typically using a simple update rule:

    weight = weight - learning_rate * gradient

In the following we will take LeNet as example for network construction and see how the procedure works.

convnet

Define the Network

import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()

        # 1 input image channel, 6 output channels, 3x3 convolution kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        # 6 input image channel, 16 output channels, 3x3 convolution kernel
        self.conv2 = nn.Conv2d(6, 16, 3)

        # Fully connected layer
        # affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120) # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):

        # Convolution operation
        x = self.conv1(x)
        # ReLU
        x = F.relu(x)
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(x, (2, 2))

        # If the size is a square, we can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2) 

        # FC layer
        x = x.view(-1, self.num_flat_features(x)) # flatten
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

    def num_flat_features(self, x):
        size = x.size()[1:] # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features
net = Net()
print(net)
Net(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=576, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

We just have to define the forward function, and the backward function (where gradients are computed) is automatically defined for us using autograd.

we can use any of the Tensor operations in the forward function.

The learnable parameters of a model are returned by net.parameters().

params = list(net.parameters())

Let’s check out the size/number of parameters in the first CONV layer:

(In the first CONV layer there’re 6 filters. Each filter has one channel and the size of 3x3)

print(f"#conv1's weight: {params[0].size()}")
#conv1's weight: torch.Size([6, 1, 3, 3])

Forward

Let’s try a random 32x32 input (expected size of LeNet)

input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)
tensor([[ 0.0114,  0.1167, -0.0449, -0.0072, -0.0791, -0.0805, -0.0467,  0.0667, -0.0750,  0.0985]], grad_fn=<AddmmBackward>)

Backward

Zero the gradient buffers of all parameters and backprops with random gradients

net.zero_grad()
out.backward(torch.randn(1, 10))

Note that torch,nn ONLY supports mini-batches. The entire torch.nn package only supports inputs that are a mini-batch of samples, and NOT a single sample.

For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width.

For a single sample, just use input.unsqueeze(0) to add a fake batch dimension.

Loss Function

A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.

output = net(input)
target = torch.randn(10) # dummy target 
target = target.view(1, -1) # make it the same shape as output
criterion = nn.MSELoss() # here we use Mean Squared Error as our loss function

loss = criterion(output, target) # Compute loss

If we follow loss in the backward direction, using its .grad_fn attribute, we will see a computational graph looks like this:

input -> conv2d -> relu -> maxpool2d 
      -> conv2d -> relu -> maxpool2d
      -> view -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss

So, when we call loss.backward(), the whole graph is differentiated w.r.t. the loss, and all Tensors in the graph that has requires_grad=True will have their .grad Tensor accumulated with the gradient.

Backpropagation

To backpropagate the error all we have to do is to loss.backward(). Remember that we need to clear the existing gradients though, else gradients will be accumulated to existing gradients.

net.zero_grad() # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward:')
print(net.conv1.bias.grad)

loss.backward(retain_graph=True)

print('conv1.bias.grad after backward:')
print(net.conv1.bias.grad)
conv1.bias.grad before backward:
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward:
tensor([-0.0037,  0.0082,  0.0034, -0.0042,  0.0045, -0.0054])

Update Weights

The simplest update rule used in practice is the Stochastic Gradient Descent (SGD):

weight = weight - learning_rate * gradient

There’re various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. Package torch.optim implements all these methods.

import torch.optim as optim

# Specify optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# In training loop:
# 1. Zero the gradient buffer
optimizer.zero_grad()

# 2. Forward propagation throught the neural network and get output
output = net(input)

# 3. Compute loss
loss = criterion(output, target)

# 4. Backprop loss and get loss w.r.t each parameter
loss.backward()

# 5. Update parameters 
optimizer.step()