DL-With-PyTorch | Haobin Tan

Deep Learning with PyTorch

Sun, 18 Oct 2020 00:00:00 +0000

Book

Code

Github repo

Almost all of our example notebooks contain the following boilerplate in the first cell (some lines may be missing in early chapters)

%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.set_printoptions(edgeitems=2)
torch.manual_seed(123)

Convention
- Variables named with a _t suffix are tensors stored in CPU memory,
- Variables named with a _g suffix are tensors in GPU memory
- Variables named with a _a suffix are NumPy arrays.

Pretrained Networks

Sun, 18 Oct 2020 00:00:00 +0000

Pretrained Network for Object Recognition

Use pretrained network in `TorchVision`

The TorchVision project

contains a few of the best-performing neural network architectures for computer vision, such as
- AlexNet (http://mng.bz/lo6z)
- ResNet (https://arxiv.org/pdf/1512.03385.pdf)
- Inception v3 (https://arxiv.org/pdf/1512.00567.pdf)
has easy access to datasets like ImageNet and other utilities for getting up to speed with computer vision applications in PyTorch.

The predefined models can be found in torchvision.models

from torchvision import models

dir(models)

['AlexNet',
 'DenseNet',
 'GoogLeNet',
 'GoogLeNetOutputs',
 'Inception3',
 'InceptionOutputs',
 'MNASNet',
 'MobileNetV2',
 'ResNet',
 'ShuffleNetV2',
 'SqueezeNet',
 'VGG',
 '_GoogLeNetOutputs',
 '_InceptionOutputs',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_utils',
 'alexnet',
 'densenet',
 'densenet121',
 'densenet161',
 'densenet169',
 'densenet201',
 'detection',
 'googlenet',
 'inception',
 'inception_v3',
 'mnasnet',
 'mnasnet0_5',
 'mnasnet0_75',
 'mnasnet1_0',
 'mnasnet1_3',
 'mobilenet',
 'mobilenet_v2',
 'quantization',
 'resnet',
 'resnet101',
 'resnet152',
 'resnet18',
 'resnet34',
 'resnet50',
 'resnext101_32x8d',
 'resnext50_32x4d',
 'segmentation',
 'shufflenet_v2_x0_5',
 'shufflenet_v2_x1_0',
 'shufflenet_v2_x1_5',
 'shufflenet_v2_x2_0',
 'shufflenetv2',
 'squeezenet',
 'squeezenet1_0',
 'squeezenet1_1',
 'utils',
 'vgg',
 'vgg11',
 'vgg11_bn',
 'vgg13',
 'vgg13_bn',
 'vgg16',
 'vgg16_bn',
 'vgg19',
 'vgg19_bn',
 'video',
 'wide_resnet101_2',
 'wide_resnet50_2']

The capitalized names (e.g. ResNet) refer to Python classes that implement a number of popular models. They differ in their architecture—that is, in the arrangement of the operations occurring between the input and the output.
- E.g.: create an instance of the AlexNet class.
```
# create an instance of AlexNet class
alexnet = models.AlexNet()
```
  But wait! If we did that, we would be feeding data through the whole network to produce … garbage!!! 😢
  
  That’s because the network is uninitialized: its weights, the numbers by which inputs are added and multiplied, have not been trained on anything—the network itself is a blank (or rather, random) slate. We’d need to either train it from scratch or load weights from prior training.
  
  To use models with predefined numbers of layers and units and optionally download and load pretrained weights into them, we need to use the lowercase name in models module.
The lowercase names are convenience functions that return models instantiated from those classes, sometimes with different parameter sets.
- For instance, resnet101 returns an instance of ResNet with 101 layers, resnet18 has 18 layers, and so on.
- Create an instance of the network and pass an argument that will instruct the function to download the weights of resnet101 trained on the ImageNet dataset, with 1.2 million images and 1,000 categories:
```
resnet = models.resnet101(pretrained=True)
```

Load and show an image from the local filesystem

Use Pillow (https://pillow.readthedocs.io/en/stable), an image-manipulation module for Python:

from PIL import Image

# assume that the variable IMG_PATH holds the path of the image
img = Image.open(IMG_PATH)
img # show the image inline

Set `eval` mode before inference

In order to do inference, we need to put the network in eval mode:

resnet.eval()

(If we forget to do that, some pretrained models, like batch normalization and dropout, will not produce meaningful answers, just because of the way they work internally.)

Retrieve image label

load a text file listing the labels in the same order they were presented to the network during training
Pick out the label at the index that produced the highest score from the network.

(Almost all models meant for image recognition have output in a form similar to that)

Torch Hub

Torch Hub is a mechanism through which authors can publish a model on GitHub, with or without pretrained weights, and expose it through an interface that PyTorch understands. This makes loading a pretrained model from a third party as easy as loading a TorchVision model.

All it takes is to place a file named hubconf.py in the root directory of the GitHub repository. An example is TorchVision, we can notice that it contains a hubconf.py.

Torch Hub is quite new, and there are only a few models published this way. We can get at them by Googling “github.com hubconf.py.”

PyTorch Tensor

Mon, 19 Oct 2020 00:00:00 +0000

import torch

The world as floating-point numbers

Neural networks transform floating-point representations into other floating- point representations. The starting and ending representations are typically human interpretable, but the intermediate representations are less so.

To handle and store data, PyTorch introduces a undamental data structure: the tensor. In the context of deep learning, tensors refer to the generalization of vectors and matrices to an arbitrary number of dimensions

Tensors: Multidimensional arrays

Another name for tensor is multidimensional array. Compared to NumPy arrays, PyTorch tensors have a few superpowers, such as

the ability to perform very fast operations on graphical processing units (GPUs)
distribute operations on multiple devices or machines
keep track of the graph of computations that created them.

Tensor construction

From python list:

a = torch.tensor(list(range(9)))
a

tensor([0, 1, 3, 3, 4, 5, 6, 7, 8])

Use constuctors from PyTorch

a = torch.ones(3, 4)
a

tensor([[1., 1., 1., 1.],
 [1., 1., 1., 1.],
 [1., 1., 1., 1.]])

The essence of tensors

Python lists or tuples of numbers are collections of Python objects that are individually allocated in memory.
PyTorch tensors or NumPy arrays are views over (typically) contiguous memory blocks containing unboxed C numeric types rather than Python objects.

Indexing tensors

Use range indxing notation just as in standard python lists.

Tensor element types

Specifying the numeric type with `dtype`

The dtype argument to tensor constructors (that is, functions like tensor, zeros, and ones) specifies the numerical data (d) type that will be contained in the tensor. The default data type for tensors is 32-bit floating-point.

E.g.

double_points = torch.ones(10, 2, dtype=torch.double)

Typical `dtype`

Computations happening in neural networks typically executed with 32-bit floating-point precision.
Tensors can be used as indexes in other tensors. In this case, PyTorch expects indexing tensors to have a 64-bit integer (int64) data type.
Predicates on tensors, such as points > 1.0, produce bool tensors indicating whether each individual element satisfies the condition.

Casting `dtype`

Cast the tensor to the right type using the corresponding casting method.

For example, cast torch.int to torch.double

points = torch.zeros(10, 2, dtype=torch.int)
points = points.double()

Or use the more convenient to method:

points = points.to(torch.double)

When mixing input types in operations, the inputs are converted to the larger type automatically.

The Tensor API

First, the vast majority of operations on and between tensors are available in the torch module and can also be called as methods of a tensor object. There is no difference between the two forms; they can be used interchangeably.

Example:

a = torch.ones(3, 2)
a_transpose = torch.transpose(a, 0, 1) # call from the torch module
a.shape, a_transpose.shape

(torch.Size([3, 2]), torch.Size([2, 3]))

a = torch.ones(3, 2)
a_transpose = a.transpose(0, 1) # method of the tensor object
a.shape, a_transpose.shape

(torch.Size([3, 2]), torch.Size([2, 3]))

The online docs (http://pytorch.org/docs) are exhaustive and well organized, with the tensor operations divided into groups:

Creation ops: Functions for constructing a tensor, like ones and from_numpy
Indexing, slicing, joining, mutating ops: Functions for changing the shape, stride, or content of a tensor, like transpose
Math ops: Functions for manipulating the content of the tensor through computations
- Pointwise ops: Functions for obtaining a new tensor by applying a function to each element independently, like abs and cos
- Reduction ops: Functions for computing aggregate values by iterating through tensors, like mean, std, and norm
- Comparison ops: Functions for evaluating numerical predicates over tensors, like equal and max
- Spectral ops: Functions for transforming in and operating in the frequency domain, like stft and hamming_window
- Other operations: Special functions operating on vectors, like cross, or matrices, like trace
- BLAS and LAPACK operations—Functions following the Basic Linear Algebra Subprograms (BLAS) specification for scalar, vector-vector, matrix-vector, and matrix-matrix operations
Random sampling: Functions for generating values by drawing randomly from probability distributions, like randn and normal
Serialization: Functions for saving and loading tensors, like load and save
Parallelism: Functions for controlling the number of threads for parallel CPU execution, like set_num_threads

Tensors: Scenic views of storage

Values in tensors are allocated in contiguous chunks of memory managed by torch.Storage instances.

A storage is a one-dimensional array of numerical data: that is, a contiguous block of memory containing numbers of a given type
A PyTorch Tensor instance is a view of such a Storage instance that is capable of indexing into that storage using an offset and per-dimension strides.

Multiple tensors can index the same storage even if they index into the data differently. For example:

The underlying memory is allocated only once. So creating alternate tensor-views of the data can be done quickly regardless of the size of the data managed by the Storage instance.👏

Indexing into `storage`

The storage for a given tensor is accessible using the .storage property:

points = torch.tensor([[4.0, 1.0], [5.0, 3.0], [2.0, 1.0]])
points.storage()

 4.0
 1.0
 5.0
 3.0
 2.0
 1.0
[torch.FloatStorage of size 6]

Even though the tensor reports itself as having three rows and two columns, the storage under the hood is a contiguous array of size 6. In this sense, the tensor just knows how to translate a pair of indices into a location in the storage.

Changing the value of a storage leads to changing the content of its referring tensor:

points

tensor([[4., 1.],
 [5., 3.],
 [2., 1.]])

points_storage[0] = 2.0 # change the value of an element of a storage
points

tensor([[2., 1.],
 [5., 3.],
 [2., 1.]])

Modifying stored values: In-place operations

Methods with trailing underscore in their name, like zero_, indicates that the method operates in place by modifying the input instead of creating a new output tensor and returning it.

Any method without the trailing underscore leaves the source tensor unchanged and instead returns a new tensor.

Example:

a = torch.ones(3, 2)
a

tensor([[1., 1.],
 [1., 1.],
 [1., 1.]])

a.zero_() # in-place zeroing a
a

tensor([[0., 0.],
 [0., 0.],
 [0., 0.]])

🧐 Tensor metadata: Size, offset, and stride

In order to index into a storage, tensors rely on a few pieces of information that, together with their storage, unequivocally define them:

size/shpae: a tuple indicating how many elements across each dimension the tensor represents.
(storage) offset: index in the storage corresponding to the first element in the tensor.
stride: number of elements in the storage that need to be skipped over to obtain the next element along each dimension.

Example:

points = torch.tensor([[4.0, 1.0], [5.0, 3.0], [2.0, 1.0]])
second_point = points[1]

Size/Shape
```
second_point.size()
second_point.shape
```
Offset
```
second_point.storage_offset()
```
Stride
```
second_point.stride()
```

This indirection between Tensor and Storage makes some operations inexpensive, like transposing a tensor or extracting a subtensor, because they do not lead to memory reallocations. 👍 Instead, they consist of allocating a new Tensor object with a different value for size, storage offset, or stride.

Cloning a tensor

Use .clone()
Changing the cloned tensor won’t change the original tensor

Transposing without copying

For two-dimensional tensors, we can use t function, a a shorthand alternative to transpose

points = torch.tensor([[3, 1, 2], [4, 1, 7]])
points

tensor([[3, 1, 2],
 [4, 1, 7]])

points_t = points.t()
points_t

tensor([[3, 4],
 [1, 1],
 [2, 7]])

These two tensors share the same storage

id(points.storage()) == id(points_t.storage())

True

They differ only in shape and stride:
- Increasing the first index by one in points—for example, going from points [0,0] to points [1,0]—will skip along the storage by two elements; while increasing the second index—from points [0,0] to points [0,1]—will skip along the storage by one. (In other words, the storage holds the elements in the tensor sequentially row by row.)
```
points.shape, points.stride()
```
```
(torch.Size([2, 3]), (3, 1))
```
The transpose from points into points_t looks like this:

We change the order of the elements in the stride. After that, increasing the row (the first index of the tensor) will skip along the storage by one, just like when we were moving along columns in points.
```
points_t.shape, points_t.stride()
```
```
(torch.Size([3, 2]), (1, 3))
```
This is the very definition of transposing. No new memory is allocated: transposing is obtained only by creating a new Tensor instance with different stride ordering than the original.

Transposing in higher dimensions

We can transpose a multidimensional array by specifying the two dimensions along which transposing should occur:

some_t = torch.ones(3, 4, 5)
some_t.shape

torch.Size([3, 4, 5])

transpose_t = some_t.transpose(0, 2)
transpose_t.shape

torch.Size([5, 4, 3])

Moving tensors between CPU and GPU

Managing a tensor’s `device` attribute

Create a tensor on the GPU by specifying the corresponding argument to the constructor:

# create a tensor on the GPU 
points_gpu = torch.tensor([[4.0, 1.0], [5.0, 3.0], [2.0, 1.0]], device='cuda')

Move tensor between CPU and GPU using the to method:

points = torch.tensor([[3, 1, 2], [4, 1, 7]]) # tensor on CPU
points_gpu = points.to(device='cuda') # copy the tensor from CPU to GPU

points_cpu = points_gpu.to(device='cpu') # copy the tensor from GPU to CPU

If our machine has more than one GPU, we can also decide on which GPU we allocate the tensor by passing a zero-based integer identifying the GPU on the machine
```
point_gpu = points.to(device='cuda:0')
```

We can also use the shorthand methods cpu and cuda instead of the to method to achieve the same goal:

a = torch.ones(3, 2)

a_gpu = a.cuda() # cpu -> gpu(cuda:0)
a_gpu = a.cuda(0) # explicitly specify which GPU
a_cpu = a_gpu.cpu() # gpu -> cpu

NumPy interoperability

PyTorch tensors can be converted to NumPy arrays and vice versa very efficiently:

Pytorch tensor –> Numpy array: numpy()
```
points = torch.ones(3, 4) # pytorch tensor
points
```
```
tensor([[1., 1., 1., 1.],
 [1., 1., 1., 1.],
 [1., 1., 1., 1.]])
```
```
points_np = points.numpy() # numpy array
points_np
```
```
array([[1., 1., 1., 1.],
 [1., 1., 1., 1.],
 [1., 1., 1., 1.]], dtype=float32)
```
‼️ Note:
- The returned array shares the same underlying buffer with the tensor storage. This means the numpy method can be effectively executed at basically no cost, as long as the data sits in CPU RAM.
- It also means modifying the NumPy array will lead to a change in the originating tensor. If the tensor is allocated on the GPU, PyTorch will make a copy of the content of the tensor into a NumPy array allocated on the CPU.
```
points_np[0][1] = 2 # changing an element of np array will also change tensor
points
```
```
tensor([[1., 2., 1., 1.],
 [1., 1., 1., 1.],
 [1., 1., 1., 1.]])
```

Numpy array –> Pytorch tensor: from_numpy()

points = torch.from_numpy(points_np)
points

tensor([[1., 2., 1., 1.],
 [1., 1., 1., 1.],
 [1., 1., 1., 1.]])

It aso use thesaem buffer-sharing strategy. I.e. Modifying the PyTorch tensor will lead to a change in the originating Numpy array:

points[1][1] = 3 # change element of tensor will also change np array
points_np

array([[1., 2., 1., 1.],
 [1., 3., 1., 1.],
 [1., 1., 1., 1.]], dtype=float32)

Serializing tensors

If the data inside is valuable, we will want to save it to a file and load it back at some point. After all, we don’t want to have to retrain a model from scratch every time we start running our program.

PyTorch uses pickle under the hood to serialize the tensor object, plus dedicated serialization code for the storage.

Save points tensor to an ourpoints.t file

# assuming the PATH variable holds the path of ourpoints.t file

torch.save(points, PATH)

Load points back:
```
points = torch.load(PATH)
```

Real-world Data Representation Using Tensors

Wed, 21 Oct 2020 00:00:00 +0000

import torch

Images

An image is represented as a collection of scalars arranged in a regular grid with a height and a width (in pixels).

grayscale image: single scalar per grid point (the pixel)
multi-color image: multiple scalars per grid point, which would typically represent different colors.
- The most common way to encode color into numbers is RGB, where a color is defined by three numbers representing the intensity of red, green, and blue.

Loading an image file

Loading a PNG image using the imageio module:

import imageio

# Assume tha PATH variable holds the path of the image
img_arr = imageio.imread(PATH)

At this point, img_arr (of shape H x W x C) is a NumPy array-like object with three dimensions:

two spatial dimensions, height (H) and width (W)
a third dimension corresponding to the red, green, and blue channels (C)

Change the layout to PyTorch supported layout

PyTorch modules dealing with image data require tensors to be laid out as C × H × W : channels, height, and width, respectively.

We can use the tensor’s permute method with the old dimensions for each new dimension to get to an appropriate layout. Given an input tensor H × W × C as obtained previously, we get a proper layout by having channel 2 first and then channels 0 and 1:

img = torch.from_numpy(img_arr) # np arr -> torch tensor
out = img.permute(2, 0, 1) # adjust to pytorch required layout

Note: the permute() operation does NOT make a copy of the tensor data. Instead, out uses the same underlying storage as img and only plays with the size and stride information at the tensor level.

Create a dataset of multiple images

To create a dataset of multiple images to use as an input for our neural networks, we store the images in a batch along the first dimension to obtain an N × C × H × W tensor.

How to do this?

Pre-allocate a tensor of appropriate size.
```
batch = torch.zeros(batch_size, 3, 256, 256, dtype=torch.uint8)
```
- dtype=torch.uint8: we’re expecting each color to be represented as an 8-bit integer, as in most photographic formats from standard consumer cameras.
Fill it with images loaded from a directory

Now we can load all PNG images from an input directory and store them in the tensor:

import os

# assume data_dir is our input directory 
filenames = [name for name in os.listdir(data_dir)
 if os.path.splitext(name)[-1] == '.png']

for i, filename in enumerate(filenames):
 img_arr = imageio.imread(os.path.join(data_dir, filename))
 img_t = torch.from_numpy(img_arr)
 img_t = img_t.permute(2, 0, 1)
 img_t = img_t[:3] # just keep the first three channels (RGB)
 batch[i] = img_t

Normalizing the data

Neural networks exhibit the best training performance when the input data ranges roughly from 0 to 1, or from -1 to 1.

So a typical thing we’ll want to do is

Cast a tensor to floating-point
Normalize the values of the pixels
- It depends on what range of the input we decide should lie between 0 and 1 (or -1 and 1)
- One possibility is to just divide the values of the pixels by 255 (the maximum representable number in 8-bit unsigned)
```
batch = batch.float() # cast to floating point tensor
batch /= 255.0 # normalize
```
- Another possibility for normalization is to compute the mean and standard deviation of the input data and scale it so that the output has zero mean and unit standard deviation across each channel:
  $$ \forall x \in \text{dataset}: \quad x:= \frac{x - \text{mean}}{\text{standard deviation}} $$
```
n_channels = batch.shape[1] # shpae is: N x C x H x W
for c in range(n_channels):
 mean = torch.mean(batch[:, c])
 std = torch.std(batch[:, c])
 batch[:, c] = (batch[:, c] - mean) / std
```

In working with images, it is good practice to compute the mean and standard deviation on all the training data in advance and then subtract and divide by these fixed, precomputed quantities.

Tabular data

Spreadsheet, CSV file, or database: a table containing one row per sample (or record), where columns contain one piece of information about our sample.

There’s no meaning to the order in which samples appear in the table (sch a table is a collection of independent samples)
Tabular data is typically not homogeneous: different columns don’t have the same type.

PyTorch tensors, on the other hand, are homogeneous. Information in PyTorch is typically encoded as a number, typically floating-point (though integer types and Boolean are supported as well).

Continuous, ordinal, and categorical values

Type of values	Have order?	Have numerical meaning?
categorical	❌	❌
ordinal	❌	✅
continuous	✅	✅

continuous values
- strictly ordered
- a difference between various values has a strict meaning
- Example
  
  Stating that package A is 2 kilograms heavier than package B, or that package B came from 100 miles farther away than A has a fixed meaning, regardless of whether package A is 3 kilograms or 10, or if B came from 200 miles away or 2,000.
- The literature actually divides continuous values further
  - ratio scale: it makes sense to say something is twice as heavy or three times farther away
  - interval scale: The time of day, does have the notion of difference, but it is not reasonable to claim that 6:00 is twice as late as 3:00
ordinal values
- The strict ordering we have with continuous values remains, but the fixed relationship between values no longer applies.
- Example:
  
  Ordering a small, medium, or large drink, with small mapped to the value 1, medium 2, and large 3. The large drink is bigger than the medium, in the same way that 3 is bigger than 2, but it doesn’t tell us anything about how much bigger.
  
  If we were to convert our 1, 2, and 3 to the actual volumes (say, 8, 12, and 24 fluid ounces), then they would switch to being interval values.
- We can’t “do math” on the values outside of ordering them (trying to average large = 3 and small = 1 does not result in a medium drink!)
categorical values
have neither ordering nor numerical meaning to their values. These are often just enumerations of possibilities assigned arbitrary numbers.
Example

Assigning water to 1, coffee to 2, soda to 3, and milk to 4. There’s no real logic to placing water first and milk last; they simply need distinct values to dif- ferentiate them. We could assign coffee to 10 and milk to –3, and there would be no significant change

Loading tabular data

Python offers several options for quickly loading a CSV file. Three popular options are

The csv module that ships with Python
NumPy
Pandas (most time- and memory-efficient)

Since PyTorch has excellent NumPy interoperability, we’ll go with that.

import csv

# assume PATH variable holds the csv file
tabular_data_numpy = np.loadtxt(PATH,
 dtype=np.float32, # type of the np arr should be
 delimiter=";", # delimiter used to separate values in each orw
 skiprows=1 # the first line should not be read since it contains the col names
 )

Convert the numpy array to pytorch tensor:

tabular_data_tensor = torch.from_numpy(tabular_data_numpy)

Get the names of each column

col_list = next(csv.reader(open(PATH), delimiter=';'))

One-hot encoding

Assume that we use 1 to 10 to represent the score/class. We could build a one-hot encoding of the scores: encode each of the 10 scores in a vector of 10 elements, with all elements set to 0 but one, at a different index for each score. For example, a score of 1 could be mapped onto the vector (1,0,0,0,0,0,0,0,0,0), a score of 5 onto (0,0,0,0,1,0,0,0,0,0), and so on. Note that there’s no implied ordering or distance (i.e. they are categorical values) when we use one-hot encoding.

We can achieve one-hot encoding using the scatter_ method, which fills the tensor with values from a source tensor along the indices provided as arguments:

# assume that we already have the score tensor
score

tensor([6, 6, ..., 7, 6])

score.shape

torch.Size([4898])

score_onehot = torch.zeros(score.shape[0], 10) # in our case: score.shape[0] = 4898
score_onehot.scatter_(1, score.unsqueeze(1), 1.0)

scatter_(dim, index, src)

dim: The dimension along which the following two arguments are specified

index: A column tensor indicating the indices of the elements to scatter

required to have the same number of dimensions as the tensor we scatter into.

Since score_onehot has two dimensions (4,898 × 10), we need to add an extra dummy dimension to score using unsqueeze

src: A tensor containing the elements to scatter or a single scalar to scatter (1, in

this case)

In other words, the previous invocation reads, “For each row, take the index of the score label (which coincides with the score in our case) and use it as the column index to set the value 1.0.” The end result is a tensor encoding categorical information.

When to categorise?

Categorical: losing the ordering part, and hoping that maybe our model will pick it up during train- ing if we only have a few categories
Continuous: introducing an arbitrary notion of distance

Text

🎯 Goal: turn text into tensors of numbers that a neural network can process.

Converting text to numbers

There are two particularly intuitive levels at which networks operate on text:

character level: processing one character at a time
word level: individual words are the finest-grained entities to be seen by the network.

The technique with which we encode text information into tensor form is the same whether we operate at the character level or the word level.

One-hot-encoding characters

First we will load the text:

# assume PATH variable holds the txt file
with open(PATH, encoding='utf8') as f:
 text = f.read()

Encoding of the character: Every written character is represented by a code (a sequence of bits of appropriate length so that each character can be uniquely identified).

We are going to one-hot encode our characters. Depending on the task at hand, we could

make all of the characters lowercase, to reduce the number of different characters in our encoding
screen out punctuation, numbers, or other characters that aren’t relevant to our expected kinds of text.

At this point, we need to parse through the characters in the text and provide a one-hot encoding for each of them: Each character will be represented by a vector of length equal to the number of different characters in the encoding. This vector will contain all zeros except a one at the index corresponding to the location of the character in the encoding.

For the sake of simplicity, we first split our text into a list of lines and pick an arbitrary line to focus on:

lines = text.split('\n') # split text into a list of lines
line = lines[200] # pick arbitrary line

letter_t = torch.zeros(len(line), 128) # 128 hardcoded due to the limits of ASCII

for i, letter in enumerate(line.lower().strip()):
 # The text uses directional double quotes, which are not valid ASCII, 
 # so we screen them out here.
 letter_index = ord(letter) if ord(letter) < 128 else 0
 letter_t[i][letter_index] = 1

One-hot encoding whole words

We’ll define a helper functionclean_words, which takes text and returns it in lowercase and stripped of punctuation.

def clean_words(input_str):
 punctuation = '.,;:"!?”“_-'
 word_list = input_str.lower().replace('\n', '').split()
 word_list = [word.strip(punctuation) for word in word_list]
 return word_list

When we call it on our “Impossible, Mr. Bennet” line, we get the following:

words_in_line = clean_words(line)
line, words_in_line

('“Impossible, Mr. Bennet, impossible, when I am not acquainted with him',
 ['impossible',
 'mr',
 'bennet',
 'impossible',
 'when',
 'i',
 'am',
 'not',
 'acquainted',
 'with',
 'him'])

Now, let’s build a mapping of all words in text to indexes in our encoding:

word_list = sorted(set(clean_words(text)))
word2index_dict = {word: i for (i, word) in enumerate(word_list)}

word2index_dict is now a dictionary with words as keys and an integer as a value. We will use it to efficiently find the index of a word as we one-hot encode it. For example, let’s look up the index of word “possible”:

word2index_dict['possible']

Let’s see how can we one-hot encode the words of sentence “Impossible, Mr. Bennet, impossible, when I am not acquainted with him”:

create an empty tensor
assign the one-hot-encoded values of the word in the sentence

# create an empty tensor
word_t = torch.zeros(len(words_in_line), len(word2index_dict))

# assign the one-hot-encoded values of the word in the sentence
for i, word in enumerate(words_in_line):
 word_index = word2index_dict[word]
 word_t[i][word_index] = 1
 print(f"{i:2} {word_index:4} {word}")

 0 6925 impossible
 1 8832 mr
 2 1906 bennet
 3 6925 impossible
 4 14844 when
 5 6769 i
 6 714 am
 7 9198 not
 8 312 acquainted
 9 15085 with
10 6387 him

word_t.shape

torch.Size([11, 15514])

The choice between character-level and word-level encoding leaves us to make a trade-off

In many languages, there are significantly fewer characters than words: representing characters has us representing just a few classes, while representing words requires us to represent a very large number of classes
On the other hand, words convey much more meaning than individual characters, so a representation of words is considerably more informative by itself.

Text embeddings

Embedding is to find an effective way to map individual words into a fixed number (let’s say, 100) dimensional space in a way that facilitates downstream learning. An ideal solution would be to generate the embedding in such a way that words used in similar contexts mapped to nearby regions of the embedding.

Embeddings are often generated using neural networks, trying to predict a word from nearby words (the context) in a sentence. In this case, we could start from one-hot-encoded words and use a (usually rather shallow) neural network to generate the embedding. Once the embedding was available, we could use it for downstream tasks.

One interesting aspect of the resulting embeddings is that similar words end up not only clustered together, but also having consistent spatial relationships with other words. For example, if we were to take the embedding vector for apple and begin to add and subtract the vectors for other words, we could begin to perform analogies like apple - red - sweet + yellow + sour and end up with a vector very similar to the one for lemon.

The Mechanics of Learning

Sat, 24 Oct 2020 00:00:00 +0000

import torch

Learning is just parameter estimation

Given
- input data
- corresponding desired outputs (ground truth)
- initial values for the weights
The model is fed input data (forward pass)
A measure of the error is evaluated by comparing the resulting outputs to the ground truth
In order to optimize the parameter of the model (its weights)
- The change in the error following a unit change in weights (that is, the gradient of the error with respect to the parameters) is computed using the chain rule for the derivative of a composite function (backward pass)
- The value of the weights is then updated in the direction that leads to a decrease in the error
- The procedure is repeated until the error, evaluated on unseen data, falls below an acceptable level.

A simple linear model

t_c = w * t_u + b

w: weight, tells us how much a given input influence the outputs.
b: bias, tells us what the output would be if inputs were zero.

Now we need to estimate w and b, the parameters in our model, based on the data we have. We must do it so that temperatures we obtain from running the unknown temperatures t_u through the model are close to temperatures we actually measured in Celsius (t_c). That sounds like fitting a line through a set of measurements!

Let’s flesh it out again:

we have a model with some unknown parameters, and we need to estimate those parameters so that the error between predicted outputs and measured values is as low as possible.
We need to exactly define a measure of the error. Such a measure, which we refer to as the loss function, should be high if the error is high and should ideally be as low as possible for a perfect match.
Our optimization process should therefore aim at finding w and b so that the loss function is at a minimum.

Modeling with PyTorch

We can define the model as a python function:

def model(t_u, w, b):
 """
 t_u: input tensor
 w: weight parameter
 b: bias parameter
 """
 return w * t_u + b

For loss function we choose Mean Square Loss (building a tensor of differences, taking their square element-wise, and finally producing a scalar loss function by averaging all of the elements in the resulting tensor):

def loss_fn(t_p,t_c):
 squared_diffs = (t_p - t_c) ** 2
 return squared_diffs.mean()

Down along the gradient

We’ll optimize the loss function with respect to the parameters using the gradient descent algorithm, which is actually a very simple idea and scales up surprisingly well to large neural network models with mil- lions of parameters.

params -= learning_rate * params.grad

PyTorch’s `autograd`

PyTorch provides a mechanisam called autograd: PyTorch tensors can remember where they come from, in terms of the operations and parent tensors that originated them, and they can automatically provide the chain of derivatives of such operations with respect to their inputs. This means

we won’t need to derive our model by hand 👏
given a forward expression, no matter how nested, PyTorch will automatically provide the gradient of that expression with respect to its input parameters 👏

Applying `autograd`

In order to activate autograd, we need to initialize the parameters tensor with requires_grad=True

params = torch.tensor([1.0, 0.0], requires_grad=True)

Using the `grad` attribute

requires_grad=True is telling PyTorch to track the entire family tree of tensors resulting from operations on params. In other words, any tensor that will have params as an ancestor will have access to the chain of functions that were called to get from params to that tensor. In case these functions are differentiable (and most PyTorch tensor operations will be), the value of the derivative will be automatically populated as a grad attribute of the params tensor.

In general, all PyTorch tensors have an attribute named grad. Normally, it’s None at the beginning:

params.grad is None

True=

All we have to do to populate it is to start with a tensor with requires_grad set to True, then call the model and compute the loss, and then call backward() on the loss tensor:

loss = loss_fn(model(t_u, *params), t_c)
loss.backward()

At this point, the grad attribute of params contains the derivatives of the loss with respect to each element of params.

What happened under the hood?

When we compute our loss while the parameters w and b require gradients, in addition to performing the actual computation, PyTorch creates the autograd graph with the operations (in black circles) as nodes:

When we call loss.backward(), PyTorch traverses this graph in the reverse direction to compute the gradients:

Note! Calling backward will lead derivatives to accumulate at leaf nodes. We need to *zero the gradient explicitly* after using it for parameter updates. We can do this easily using the inplace zero_ method:

if params.grad is not None:
 params.grad.zero_()

Now our autograd-enabled training code looks like this:

def training_loop(n_epochs, learning_rate, params, t_u, t_c):
 for epoch in range(1, n_epochs + 1):
 if params.grad is not None:
 params.grad.zero_()

 # forward pass
 t_p = model(t_u, *params)

 # backward pass
 loss = loss_fn(t_p, t_c)
 loss.backward()

 # update params
 with torch.no_grad():
 params -= learning_rate * params.grad

 # logging
 if epoch % 500 == 0:
 print('Epoch %d, Loss %f' % (epoch, float(loss)))

 return params

PyTorch’s optimizers

There are several optimization strategies and tricks that can assist convergence, especially when models get complicated. The torch module has an optim submodule where we can find classes implementing different optimization algorithms.

import torch.optim as optim

dir(optim)

['ASGD',
 'Adadelta',
 'Adagrad',
 'Adam',
 'AdamW',
 'Adamax',
 'LBFGS',
 'Optimizer',
 'RMSprop',
 'Rprop',
 'SGD',
 'SparseAdam',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'lr_scheduler']

Every optimizer constructor takes a list of parameters (aka PyTorch tensors, typically with requires_grad set to True) as the first input. All parameters passed to the optimizer are retained inside the optimizer object so the optimizer can update their values and access their grad attribute:

Each optimizer exposes two methods

zero_grad: zeroes the grad attribute of all the parameters passed to the optimizer upon construction.
step: updates the value of those parameters according to the optimization strategy implemented by the specific optimizer.

Let’s apply optimizer to our training loop:

# initialize parameters
params = torch.tensor([1.0, 0.0], requires_grad=True)

# choose learning rate and optimizer
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)

def training_loop(n_epochs, optimizer, params, t_u, t_c):
 for epoch in range(1, n_epochs+1):
 t_p = model(t_u, *params)
 loss = loss_fn(t_p, t_c)

 # zero_grad before backward!
 optimizer.zero_grad()

 loss.backward()

 # # update params
 optimizer.step()

 if epoch % 500 == 0:
 print(f'Epoch: {epoch}, loss: {float(loss)}')

 return params

Training, validation, and overfitting

A highly adaptable model will tend to use its many parameters to make sure the loss is minimal at the data points, but we’ll have no guarantee that the model behaves well away from or in between the data points. 🤪

Overfitting: Evaluating the loss at independent data points yield higher-than-expected loss.

To overcome overfitting,

we must take a few data points out of our dataset (the validation set) and only fit our model on the remaining data points (the training set).
while we’re fitting the model, we can evaluate the loss once on the training set and once on the validation set.
When we’re trying to decide if we’ve done a good job of fitting our model to the data, we must look at both!

Evaluating the training loss

If the training loss is not decreasing, there may be two possibilities:

the model is too simple for the data
our data just doesn’t contain meaningful information that lets it explain the output

Generalizing to the validation set

If the training loss and the validation loss diverge, we’re overfitting. Overfitting really looks like a problem of making sure the behavior of the model in between data points is sensible for the process we’re trying to approximate.

How to avoid overfitting?

Make sure we get enough data for the process
Make our model simple

A simpler model may not fit the training data as perfectly as a more complicated model would, but it will likely behave more regularly in between data points.
Make sure the model that is capable of fitting the training data is as regular as possible in between them.
- Adding penalization terms to the loss function, to make it cheaper for the model to behave more smoothly and change more slowly (up to a point)
- Add noise to the input samples, to artificially create new data points in between training data samples and force the model to try to fit those, too.

We’ve got some nice trade-offs:

we need the model to have enough capacity for it to fit the training set.
we need the model to avoid overfitting

Therefore, in order to choose the right size for a neural network model in terms of parameters, the process is based on two steps:

increase the size until it fits,
then scale it down until it stops overfitting.

Splitting a dataset

Use PyTorch’s randperm function

randperm function: Shuffle the elements of a tensor amounts to finding a permutation of its indices.

n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples)

shuffled_indices = torch.randperm(n_samples)

train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

# training set
train_t_u = t_u[train_indices]
train_t_c = t_c[train_indices]

# validation set
val_t_u = t_u[val_indices]
val_t_c = t_c[val_indices]

Our training loop doesn’t really change. We just want to additionally evaluate the validation loss at every epoch, to have a chance to recognize whether we’re overfitting:

def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u,
 train_t_c, val_t_c):
 for epoch in range(1, n_epochs + 1):
 train_t_p = model(train_t_u, *params)

 train_loss = loss_fn(train_t_p, train_t_c)

 val_t_p = model(val_t_u, *params)
 val_loss = loss_fn(val_t_p, val_t_c)

 optimizer.zero_grad()
 train_loss.backward()
 optimizer.step()

 if epoch <= 3 or epoch % 500 == 0:
 print(f"Epoch {epoch}, Training loss {train_loss.item():.4f},"
 f" Validation loss {val_loss.item():.4f}")

 return params

Observing the training

Our main goal: both the training loss and the validation loss decreasing. While ideally both losses would be roughly the same value, as long as the validation loss stays reasonably close to the training loss, we know that our model is continuing to learn generalized things about our data.

Switching `autograd` off for validation

We only ever call backward on train_loss and errors will only ever backpropagate based on the training set. The validation set is used to provide an independent evaluation of the accuracy of the model’s output on data that wasn’t used for training.

Since we’re not ever calling backward on val_loss, we could in fact just call model and loss_fn as plain functions, without tracking the computation. PyTorch allows us to switch off autograd when we don’t need it, using the torch.no_grad context manager.

def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u,
 train_t_c, val_t_c):
 for epoch in range(1, n_epochs + 1):
 train_t_p = model(train_t_u, *params)

 train_loss = loss_fn(train_t_p, train_t_c)

 with torch.no_grad():
 val_t_p = model(val_t_u, *params)
 val_loss = loss_fn(val_t_p, val_t_c)

 # Checks that our output requires_grad args are 
 # forced to False inside this block
 assert val_loss.requires_grad == False

 optimizer.zero_grad()
 train_loss.backward()
 optimizer.step()

 if epoch <= 3 or epoch % 500 == 0:
 print(f"Epoch {epoch}, Training loss {train_loss.item():.4f},"
 f" Validation loss {val_loss.item():.4f}")

 return params

Run with `autograd` enabled or disabled

Using the related set_grad_enabled context, we can also condition the code to run with autograd enabled or disabled, according to a Boolean expression—typically indicating whether we are running in training or inference mode.

For instance, we could define a calc_forward function that takes data as input and runs model and loss_fn with or without autograd according to a Boolean is_train argument:

def cal_forward(t_u, t_c, is_train):
 with torch.set_grad_enabled(is_train):
 t_p = model(t_u, *params)
 loss = loss_fn(t_p, t_c)

 return loss

Using Neural Network to Fit Data

Mon, 26 Oct 2020 00:00:00 +0000

Artficial neurons

Core of deep learning are neural networks: mathematical entities capable of representing complicated functions through a composition of simpler functions.

The basic building block of these complicated functions is the neuron

At its core, it is nothing but a linear transformation of the input (for example, multiplying the input by a number [the weight] and adding a constant [the bias]) followed by the application of a fixed nonlinear function (referred to as the activation function).
Mathematically, we can write this out as o = f(w * x + b), with x as our input, w our weight or scaling factor, and b as our bias or offset. f is our activation function, set to the hyperbolic tangent, or tanh function here.

Composing a multilayer network

is made up of a composition of functions like those we just discussed

x_1 = f(w_0 * x + b_0)
x_2 = f(w_1 * x_1 + b_1)
...
y = f(w_n * x_n + b_n)

where the output of a layer of neurons is used as an input for the following layer.

The error function

Neural networks do not have property of a convex error surface
There’s no single right answer for each parameter we’re attempting to approximate. Instead, we are trying to get all of the parameters, when acting in concert, to produce a useful output.
Since that useful output is only going to approximate the truth, there will be some level of imperfection. Where and how imperfections manifest is somewhat arbitrary, and by implication the parameters that control the output (and, hence, the imperfections) are somewhat arbitrary as well. 🤪

Activation functions

The simplest unit in (deep) neural networks is a linear operation (scaling + offset) followed by an activation function. The activation function plays two important roles:

In the inner parts of the model, it allows the output function to have different slopes at different values—something a linear function by definition cannot do. By trickily composing these differently sloped parts for many outputs, neural networks can approximate arbitrary functions
At the last layer of the network, it has the role of concentrating the outputs of the preceding linear operation into a given range.
- Capping the output range
- Compressing the output range

Some activation functions:

ReLU (Rectified Linear Unit) is currently considered one of the best-performing general activation functions. The LeakyReLU function modifies the standard ReLU to have a small positive slope, rather than being strictly zero for negative inputs (typically this slope is 0.01, but it’s shown here with slope 0.1 for clarity).

Choosing the best activation function

By definition, activation functions are

nonlinear: The nonlinearity allows the overall network to approximate more complex functions.
differentiable: so that gradients can be computed through them.

The following are true for the functions:

They have at least one sensitive range, where nontrivial changes to the input result in a corresponding nontrivial change to the output. This is needed for training.
Many of them have an insensitive (or saturated) range, where changes to the input result in little or no change to the output.

Often (but far from universally so), the activation function will have at least one of these:

A lower bound that is approached (or met) as the input goes to negative infinity
A similar-but-inverse upper bound for positive infinity

🤔 What learning means for a neural network

Building models out of stacks of linear transformations followed by differentiable activations leads to models that can approximate highly nonlinear processes and whose parameters we can estimate surprisingly well through gradient descent, even when dealing with models with millions of parameters. What makes using deep neural networks so attractive is that it saves us from worrying too much about the exact function that represents our data. With a deep neural network model, we have a universal approximator and a method to estimate its parameters. 👏

Training consists of finding acceptable values for these weights and biases so that the resulting network correctly carries out a task. By carrying out a task successfully, we mean obtaining a correct output on unseen data produced by the same data-generating process used for training data. A successfully trained network, through the values of its weights and biases, will capture the inherent structure of the data in the form of meaningful numerical representations that work correctly for previously unseen data.

Deep neural networks give us the ability to approximate highly nonlinear phenomena without having an explicit model for them. Instead, starting from a generic, untrained model, we specialize it on a task by providing it with a set of inputs and outputs and a loss function from which to backpropagate. Specializing a generic model to a task using examples is what we refer to as learning, because the model wasn’t built with that specific task in mind—no rules describing how that task worked were encoded in the model.

The PyTorch `nn` module

torch.nn

submodule dedicated to neural networks
contains the building blocks needed to create all sorts of neural network architectures. Those building blocks are called modules in PyTorch parlance (such building blocks are often referred to as layers in other frameworks).

A module

can have one or more Parameter instances as attributes, which are tensors whose values are optimized during the training process
can also have one or more submodules (subclasses of nn.Module) as attributes, and it will be able to track their parameters as well.

Using `call` rather than `forward`

All PyTorch-provided subclasses of nn.Module have their __call__ method defined. This allows us to instantiate an nn.Linear and call it as if it was a function.
From user code, we should not call forward directyly

y = model(x) # correct
y = model.forward(x) # Don't do it!

Dealing with batches

PyTorch nn.Module and its subclasses are designed to do so on multiple samples at the same time.

Modules expect the zeroth dimension of the input to be the number of samples in the batch.
E.g, we can create an input tensor of size B × Nin, where
- B: the size of the batch
- Nin: the number of input features

The reason we want to do this batching is multifaceted:

Make sure the computation we’re asking for is big enough to saturate the computing resources we’re using to perform the computation
- GPUs in particular are highly parallelized, so a single input on a small model will leave most of the computing units idle. By providing batches of inputs, the calculation can be spread across the otherwise-idle units, which means the batched results come back just sas quickly as a single result would.
ome advanced models use statistical information from the entire batch, and those statistics get better with larger batch sizes.

Loss functions

Loss functions in nn are still subclasses of nn.Module, so we will create an instance and call it as a function.

Our training loop looks like this:

def training_loop(n_epochs, optimizer, model, loss_fn, t_u_train, t_u_val,
 t_c_train, t_c_val):
 for epoch in range(1, n_epochs+1):

 # forward pass in training set
 t_p_train = model(t_u_train)
 loss_train = loss_fn(t_p_train, t_c_train)

 # forward pass in validation set
 with torch.no_grad():
 t_p_val = model(t_u_val)
 loss_val = loss_fn(t_p_val, t_c_val)

 optimizer.zero_grad()
 loss_train.backward()
 optimizer.step()

 if epoch == 1 or epoch % 1000 == 0:
 print(f"Epoch {epoch}, Training loss {loss_train.item():.4f},"
 f" Validation loss {loss_val.item():.4f}")

, and we want to use Mean Square Error (MSE) as our loss function:

linear_model = nn.Linear(1, 1)
optimizer = optim.SGD(linear_model.parameters(), lr=1e-2)

training_loop(n_epochs=3000, optimizer=optimizer, model=linear_model,
 loss_fn=nn.MSELoss(), t_u_train=t_un_train, t_u_val=t_un_val,
 t_c_train=t_c_train, t_c_val=t_c_val)

Building neural networks using PyTorch

`nn.Sequential` container

nn provides a simple way to concatenate modules through the nn.Sequential container. For example, let’s build the simplest possible neural network: a linear module, followed by an activation function, feeding into another linear module.

seq_model = nn.Sequential(nn.Linear(1, 13), # 1 input feature to 13 hidden features
 nn.Tanh(), # pass them through a tanh activation
 nn.Linear(13, 1)) # linearly combine the resulting 13 numbers into 1 output feature

seq_model

Sequential(
 (0): Linear(in_features=1, out_features=13, bias=True)
 (1): Tanh()
 (2): Linear(in_features=13, out_features=1, bias=True)
)

Inspecting parameters

Calling model.parameters() will collect weight and bias from both the first and second linear modules. It’s instructive to inspect the parameters in this case by printing their shapes:

[param.shape for param in seq_model.parameters()]

[torch.Size([13, 1]), torch.Size([13]), torch.Size([1, 13]), torch.Size([1])]

We can also use named_parameters to identify parameters by name:

for name, param in seq_model.named_parameters():
 print(f"{name}: {param.shape}")

0.weight: torch.Size([13, 1])
0.bias: torch.Size([13])
2.weight: torch.Size([1, 13])
2.bias: torch.Size([1])

Sequential also accepts an OrderedDict, in which we can name each module passed to Sequential:

from collections import OrderedDict

seq_model = nn.Sequential(OrderedDict([('hidden_linear', nn.Linear(1, 8)),
 ('hidden_activation', nn.Tanh()),
 ('output_linear', nn.Linear(8, 1))]))

seq_model

Sequential(
 (hidden_linear): Linear(in_features=1, out_features=8, bias=True)
 (hidden_activation): Tanh()
 (output_linear): Linear(in_features=8, out_features=1, bias=True)
)

for name, param in seq_model.named_parameters():
 print(f"{name}: {param.shape}")

hidden_linear.weight: torch.Size([8, 1])
hidden_linear.bias: torch.Size([8])
output_linear.weight: torch.Size([1, 8])
output_linear.bias: torch.Size([1])

We can also access a particular Parameter by using submodules as attributes:

seq_model.output_linear.bias

Parameter containing:
tensor([-0.0328], requires_grad=True)

Learning from Images

Mon, 26 Oct 2020 00:00:00 +0000

Dataset of images

torchvision module:

automatically download the dataset
load it as a collection of PyTorch tensors

For example, download CIFAR-10 dataset:

from torchvision import datasets

data_path = '../data-unversioned/p1ch7/' # root directory

# Instantiates a dataset for the training data; 
# TorchVision downloads the data if it is not present.
cifar10 = datasets.CIFAR10(data_path, train=True, download=True)

# With train=False, this gets us a dataset for the validation data
cifar10_val = datasets.CIFAR10(data_path, train=False, download=True)

dataset submodule:

gives us precanned access to the most popular computer vision datasets, such as MNIST, Fashion-MNIST, CIFAR-100, SVHN, Coco, and Omniglot.
In each case, the dataset is returned as a subclass of torch.utils.data.Dataset.

`Dataset` class

torch.utils.data.Dataset:

Concept: does NOT necessarily hold the data, but provides uniform access to it through __len__ and __getitem__
is an object that is required to implement two methods:
- __len__: returns the number of items in the dataset
- __getitem__: returns the item, consisting of a smaple and its corresponding label (an integer index)

Dataset transformations

torchvision.transforms

defines a set of composable, function-like objects that can be passed as an argument to a torchvision dataset

from torchvision import transforms

dir(transforms)

['CenterCrop',
 'ColorJitter',
 'Compose',
 'ConvertImageDtype',
 'FiveCrop',
 'Grayscale',
 'Lambda',
 'LinearTransformation',
 'Normalize',
 'PILToTensor',
 'Pad',
 'RandomAffine',
 'RandomApply',
 'RandomChoice',
 'RandomCrop',
 'RandomErasing',
 'RandomGrayscale',
 'RandomHorizontalFlip',
 'RandomOrder',
 'RandomPerspective',
 'RandomResizedCrop',
 'RandomRotation',
 'RandomSizedCrop',
 'RandomVerticalFlip',
 'Resize',
 'Scale',
 'TenCrop',
 'ToPILImage',
 'ToTensor',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'functional',
 'functional_pil',
 'functional_tensor',
 'transforms']

perform transformations on the data after it is loaded but before it is returned by __getitem__.

`ToTensor`

turns NumPy arrays and PIL images to tensors.
also takes care to lay out the dimensions of the output tensor as C × H × W
Once instantiated, it can be called like a function with the PIL image as the argument, returning a tensor as output
```
from torchvision import transforms

to_tensor = transforms.ToTensor()
img_t = to_tensor(img)
```

We can pass the transform dierctly as an argument to dataset instructor:

tensor_cifar10 = datasets.CIFAR10(data_path, train=True, download=False,
 transform=transforms.ToTensor())

At this point, accessing an element of the dataset will return a tensor, rather than a PIL image
```
img_t, _ = tensor_cifar10[99]
type(img_t)
```
```
torch.Tensor
```
Whereas the values in the original PIL image ranged from 0 to 255 (8 bits per channel), the ToTensor transform turns the data into a 32-bit floating-point per channel, scaling the values down from 0.0 to 1.0.
```
img_t.min(), img_t.max()
```
```
(tensor(0.), tensor(1.))
```

Normalizing data

We can chain transforms using transforms.Compose, and they can handle normalization and data augmentation transparently, directly in the data loader. It’s good practice to normalize the dataset so that each channel has zero mean and unitary standard deviation. Also, normalizing each channel so that it has the same distribution will ensure that channel information can be mixed and updated through gradient descent using the same learning rate.

transforms.Normalize: compute the mean value and the standard deviation of each channel across the dataset and apply the following transform: v_n[c] = (v[c] - mean[c]) / stdev[c].

Note that the values of mean and stdev must be computed offline in advance (they are not computed by the transform).

Steps for normalization:

Stack all the tensors returned by the dataset along an extra dimension

imgs = torch.stack([img_t for img_t, _ in tensor_cifar10], dim=3)
imgs.shape

torch.Size([3, 32, 32, 50000])

(Channels x Height x Width x #images)

Compute mean and standard derivation per channel:

Mean

# Recall that view(3, -1) keeps the three channels and 
# merges all the remaining dimensions into one, figuring out the appropriate size. 
# Here our 3 × 32 × 32 image is transformed into a 3 × 1,024 vector, 
# and then the mean is taken over the 1,024 elements of each channel.
imgs.view(3, -1).mean(dim=1)

tensor([0.4915, 0.4823, 0.4468])

Standard derivation

imgs.view(3, -1).std(dim=1)

tensor([0.2470, 0.2435, 0.2616])

Initialize the Normalize transform and chain it in transforms.Compose

transforms.Normalize((0.4915, 0.4823, 0.4468), (0.2470, 0.2435, 0.2616))

transform = transforms.Compose([transforms.ToTensor(),
 transforms.Normalize((0.4915, 0.4823, 0.4468),
 (0.247, 0.2435, 0.2616))])

transformed_cifar10 = datasets.CIFAR10(data_path, train=True, download=False,
 transform=transform)

Classifier

Assume that we’ll pick out all the birds and airplanes from our CIFAR-10 dataset and build a neural network that can tell birds and airplanes apart. This is a classification problem.

A fully connected model

An image is just a set of numbers laid out in a spatial configuration. In theory if we just take the image pixels and straighten them into a long 1D vector, we could consider those numbers as input features, which can be illustrated with the following figure:

In our case, each image is 32 x 32 x 3, that’s 3072 input features per sample. Let’s build a simple fully connected neural network:

import torch.nn as nn

n_input = 3072
n_hidden = 512 # just arbitrary choice
n_out = 2 # there're 2 classes: bird and airplan

model = nn.Sequential(nn.Linear(n_in, n_hidden),
 nn.Tanh(),
 nn.Linear(n_hidden, n_out))

Output of a classifier

We need to recgnize that the output is categorical: it’s either a bird or an airplane.

In the ideal caes, the network would ouput torch.tensor([1.0, 0.0]) for an airplane and torch.tensor([0.0, 1.0]) for a bird. Practically speaking, since our classifier will not be perfect, we can expect the network to output something in between. The key realization in this case is that we can interpret our output as probabilities: the first entry is the probability of “airplane,” and the second is the probability of “bird.”

Casting the problem in terms of probabilities imposes a few extra constraints on the outputs of our network:

Each element of the output must be in the [0.0, 1.0] range (a probability of an outcome cannot be less than 0 or greater than 1).
The elements of the output must add up to 1.0 (we’re certain that one of the two outcomes will occur).

This is called softmax: we take the elements of the vector, compute the elementwise exponential, and divide each element by the sum of exponentials

In code:

def softmax(x):
 return torch.exp(x) / torch.exp(x).sum()

The nn module makes softmax available as a module, which requires us to specify the dimension along which the softmax function is applied. Now we add a softmax at the end of our model,

model = nn.Sequential(nn.Linear(n_in, n_hidden),
 nn.Tanh(),
 nn.Linear(n_hidden, n_out),
 nn.Softmax(dim=1))

After training, we will be able to get the label as an index by computing the argmax of the output probabilities: that is, the index at which we get the maximum probability. Conveniently, when supplied with a dimension, torch.max returns the maximum element along that dimension as well as the index at which that value occurs.

_, index = torch.max(out, dim=1)

Loss for classifying

We want to penalize misclassifications. What we need to maximize is the probability associated with the correct class, which is referred to as the likelihood. I.e, we want a loss function that is

high when the likelihood is low: so low that the alternatives have a higher probability.
low when the likelihood is higher than the alternatives, and we’re not really fixated on driving the probability up to 1.

A loss function behaves that way is called negative log likelihood (NLL):

PyTorch has an nn.NLLLoss class.

Gotcha ahead!!!

nn.NLLLoss does NOT take probabilities but rather takes a tensor of log probabilities as input. It then computes the NLL of our model given the batch of data.

The workaround is to use nn.LogSoftmax instead of nn.Softmax, which takes care to make the calculation numerically stable.

model = nn.Sequential(nn.Linear(n_in, n_hidden),
 nn.Tanh(),
 nn.Linear(n_hidden, 2),
 nn.LogSoftmax(dim=1))

loss = nn.NLLLoss()

# compute the NLL loss for a single sample:
img, label = cifar2[0] # cifar2 is the modified dataset containing only birds and airplanes
out = model(img.view(-1).unsqueeze(0))

loss(out, torch.tensor([label]))

A more convenient way is to use nn.CrossEntropyLoss, which is equivalent to the combination of nn.LogSoftmax and nn.NLLLoss. This cross entropy can be interpreted as a negative log likelihood of the predicted distribution under the target distribution as an outcome.

In this case, we drop the last nn.LogSoftmax layer from the network and use nn.CrossEntropyLoss as a loss:

model = nn.Sequential(nn.Linear(n_in, n_hidden),
 nn.Tanh(),
 nn.Linear(n_hidden, 2))

loss_fn = nn.CrossEntropyLoss()

The number will be exactly the same as with nn.LogSoftmax and nn.NLLLoss. It’s just more convenient to do it all in one pass, with the only gotcha being that the output of our model will NOT be interpretable as probabilities (or log probabilities). We’ll need to explicitly pass the output through a softmax to obtain those.

Training the classifier

Training the classifier is similar to the process we’ve learned before:

import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(n_in, n_hidden),
 nn.Tanh(),
 nn.Linear(n_hidden, 2))

loss_fn = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=learning_rate)

loss_fn = nn.NLLLoss()

n_epochs = 100

for epoch in range(n_epochs):
 for img, label in cifar2:
 # forward
 out = model(img.view(-1).unsqueeze(0))
 loss = loss_fn(out, torch.tensor([label]))

 optimizer.zero_grad()

 # backward
 loss.backward()

 # update
 optimizer.step()
 print(f'Epoch: {epoch}, Loss: {float(loss):4.3f}')

Data loader

The torch.utils.data module has a class that helps with shuffling and organizing the data in minibatches: DataLoader. The job of a data loader is to sample minibatches from a dataset, giving us the flexibility to choose from different sampling strategies.

A very common strategy is uniform sampling after shuffling the data at each epoch:

The DataLoader constructor takes a Dataset object as input, along with batch_size and a shuffle Boolean that indicates whether the data needs to be shuffled at the beginning of each epoch:

train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64, shuffle=True) # training set
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64, shuffle=False) # validation set

A DataLoader can be iterated over, so we can use it directly in the inner loop of our new training code:

for epoch in range(n_epochs):
 for imgs, labels in train_loader:
 batch_size = imgs.shape[0]
 outputs = model(imgs.view(batch_size, -1))
 loss = loss_fn(outputs, labels)

 optimizer.zero_grad()
 loss.backward()
 optimizer.step()

 # Due to the shuffling, this now prints the loss for a random batch
 print(f"Epoch: {epoch}, Loss: {float(loss):4.3f}")

Parameters of the model

PyTorch offers a quick way to determine how many parameters a model has through the parameters() method of nn.Model.

To find out how many elements are in each tensor instance, we can call the numel method. Summing those gives us our total count. Depending on our use case, counting parameters might require us to check whether a parameter has requires_grad set to True, as well. We might want to differentiate the number of trainable parameters from the overall model size.

numel_list = [p.numel() for p in model.parameters() if p.requires_grad == True]
sum(numel_list), numel_list

(1574402, [1572864, 512, 1024, 2])

7.2.7 The limits of going fully connected

The model we trained above is like taking every single input value—that is, every single component in our RGB image—and computing a linear combination of it with all the other values for every output feature.

On one hand, we are allowing for the combination of any pixel with every other pixel in the image being potentially relevant for our task.
On the other hand, we aren’t utilizing the relative position of neighboring or faraway pixels, since we are treating the image as one big vector of numbers.

The problem of our fully connected network is: it is NOT translation invariant. The solution to our current set of problems is to change our model to use convolutional layers.

Using Convolution to Generalize

Tue, 27 Oct 2020 00:00:00 +0000

Convolutions

Convolutions deliver locality and translation invariance

If we want to recognize patterns corresponding to objects, we will likely need to look at how nearby pixels are arranged, and we will be less interested in how pixels that are far from each other appear in combination.
- In order to translate this intuition into mathematical form, we could compute the weighted sum of a pixel with its immediate neighbors, rather than with all other pixels in the image.

What convolutions do

Translation invariant: we want these localized patterns to have an effect on the output regardless of their location in the image.

Convolution is defined for a 2D image as the scalar product of a weight matrix, the kernel, with every neighborhood in the input. The following figure illustrates applying a 3x3 kernel on a 2D image:

The weights in the kernel are NOT known in advance, but they are initialized randomly and updated through backpropagation.
It is the SAME kernel, and thus each weight in the kernel, is reused across the whole image.
- Thinking back to autograd, this means the use of each weight has a history spanning the entire image. Thus, the derivative of the loss with respect to a convolution weight includes contributions from the entire image.

Summarizing, by using to convolutions, we get

Local operations on neighborhoods 👏
Translation invariance 👏
Models with a lot fewer parameters 👏
- With a convolution layer, the number of parameters depends on
  - the size of the convolution kernel (3x3, 5x5, and so on)
  - how manyy convlution filters (or output channels) we decide to use in our model.

Convolutions in PyTorch

The torch.nn module provides convolutions for 1, 2, and 3 dimensions:

nn.Conv1d for time series
nn.Conv2d for images
nn.Conv3d for volumes or videos

For image data, we will use nn.Conv2d. The arguments we provide to nn.Conv2d are

the number of input features/channels (since we’re dealing with multichannel images: that is, more than one value per pixel)
the number of output features
the size of the kernel

It is very common to have kernel sizes that are the same in all directions, so PyTorch has a shortcut for this: whenever kernel_size=3 is specified for a 2D convolution, it means 3 × 3 (provided as a tuple (3, 3) in Python).

For example:

in_ch = 3 # 3 input features epr pixel (the RGB channels)
out_ch = 16 # arbitrary number of channels in the output

conv = nn.Conv2d(in_ch, out_ch, kernel_size=3)
conv

Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1))

In addition, we need to add the zeroth batch dimension with unsqueeze if we want to call the conv module with one input image, since nn.Conv2d expects a B × C × H × W shaped tensor as input:

# cifar2 is a modified cifar10 which contains only airplanes and birds

img, _ = cifar2[0]
output = conv(img.unsqueeze(dim=0))
img.unsqueeze(0).shape, output.shape

(torch.Size([1, 3, 32, 32]), torch.Size([1, 16, 30, 30]))

Padding the boundary

By default, PyTorch will slide the convolution kernel within the input picture, getting width - kernel_width + 1 horizontal and vertical positions. PyTorch gives us the possibility of padding the image by creating ghost pixels around the border that have value zero as far as the convolution is concerned.

conv = nn.Conv2d(3, 1, kernel_size=3, padding=1)
output = conv(img.unsqueeze(0))
img.unsqueeze(0).shape, output.shape

(torch.Size([1, 3, 32, 32]), torch.Size([1, 1, 32, 32]))

🤔 Reasons to pad convolutions

Doing so helps us separate the matters of convolution and changing image sizes, so we have one less thing to remember
when we have more elaborate structures such as skip connections or the U-Nets, we want the tensors before and after a few convolutions to be of compatible size so that we can add them or take differences.

Detecting features with convolutions

With deep learning, we let kernels be estimated from data in whatever way the discrimination is most effective. The the job of a convolutional neural network is to estimate the kernel of a set of filter banks in successive layers that will transform a multichannel image into another multichannel image, where different channels correspond to different features (such as one channel for the average, another channel for vertical edges, and so on).

The following figure shows how the training automatically learns the kernels:

Pooling

From large to small: downsampling

Max pooling: taking non-overlapping 2 x 2 tiles and taking the maximum over each of them as the new pixel at the reduced scale.

![截屏2020-10-28 21.00.33](https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-10-28 21.00.33.png)

💡 Intuition of max pooling:

The output images from a convolution layer, especially since they are followed by an activation just like any other linear layer, tend to have a high magnitude where certain features corresponding to the estimated kernel are detected (such as vertical lines). By keeping the highest value in the 2 × 2 neighborhood as the downsampled output, we ensure that the features that are found survive the downsampling, at the expense of the weaker responses.

Max pooling is provided by the nn.MaxPool2d module. It takes as input the size of the neighborhood over which to operate the pooling operation. If we wish to downsample our image by half, we’ll want to use a size of 2.

pool = nn.MaxPool2d(2)
output = pool(img.unsqueeze(dim=0))

img.unsqueeze(0).shape, output.shape

(torch.Size([1, 3, 32, 32]), torch.Size([1, 3, 16, 16]))

Combining convolutions and downsampling

Combining convolutions and downsampling can help us recognize larger structures

we start by applying a set of 3 × 3 kernels on our 8 × 8 image, obtaining a multichannel output image of the same size.
Then we scale down the output image by half, obtaining a 4 × 4 image, and apply another set of 3 × 3 kernels to it.

The second set of kernels
- operates on a 3 × 3 neighborhood of something that has been scaled down by half, so it effectively maps back to 8 × 8 neighborhoods of the input.
- takes the output of the first set of kernels (features like averages, edges, and so on) and extracts additional features on top of those.

Summarizing up:

the first set of kernels operates on small neighborhoods on first-order, low-level features,
while the second set of kernels effectively operates on wider neighborhoods, producing features that are compositions of the previous features.

This is a very powerful mechanism that provides convolutional neural networks with the ability to see into very complex scenes 💪

Subclassing `nn.Module`

In order to subclass nn.Module

we need to define a forward function that takes the inputs to the module and returns the output. (This is where we define our module’s computation.)
- With PyTorch, if we use standard torch operations, autograd will take care of the backward pass automatically 👏; and indeed, an nn.Module never comes with a backward.
To use other submodules (premade like convolutions or cutomized), we typically define them in the constructor __init__ and assign them to self for use in the forward function. Before we can do that, we need toc all super().__init__()

For example, let’s model the following network:

class Net(nn.Module):

 def __init__(self):
 super().__init__()

 self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
 self.act1 = nn.Tanh()
 self.pool1 = nn.MaxPool2d(2)

 self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
 self.act2 = nn.Tanh()
 self.pool2 = nn.MaxPool2d(2)

 self.fc1 = nn.Linear(8 * 8 * 8, 32)
 self.act3 = nn.Tanh()
 self.fc2 = nn.Linear(32, 2)

 def forward(self, x):
 out = self.pool1(self.act1(self.conv1(x)))
 out = self.pool2(self.act2(self.conv2(out)))

 # we leave the batch dimension as –1 in the call to view, 
 # since in principle we don’t know how many samples will be in the batch.
 out = out.view(-1, 8 * 8 * 8)

 out = self.act3(self.fc1(out))
 out = self.fc2(out)

 return out

Keep track of parameters and submodules

Assigning an instance of nn.Moduleto an attribute in an nn.Module automatically registers the module as a submodule. We can call arbitrary methods of an nn.Module subclass.

We can call arbitrary methods of an nn.Module subclass. This allows Net to have access to the parameters of its submodules without further action by the user:

model = Net()

numel_list = [p.numel() for p in model.parameters()]
sum(numel_list), numel_list

(18090, [432, 16, 1152, 8, 16384, 32, 64, 2])

The functional API

Looking back at the implementation of the Net class, it appears a bit of a waste that we are also registering submodules that have no parameters, like nn.Tanh and nn.MaxPool2d. It would be easier to call these directly in the forward function, just as we called view.

PyTorch has functional counterparts for every nn module.

By “functional” here we mean “having no internal state”—in other words, “whose output value is solely and fully determined by the value input arguments.”

torch.nn.functional provides many functions that work like the modules we find in nn . Instead of working on the input arguments and stored parameters like the module counterparts, they take inputs and parameters as arguments to the function call. For instance, the functional counterpart of nn.Linear is nn.functional.linear, which is a function that has signature linear(input, weight, bias=None). The weight and bias parameters are arguments to the function.

Let’s switch to the functional counterparts of pooling and activation, since they have no parameters:

import torch.nn.functional as F

class Net(nn.Module):

 def __init__(self):
 super().__init__()
 self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
 self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
 self.fc1 = nn.Linear(8 * 8 * 8, 32)
 self.fc2 = nn.Linear(32, 2)

 def forward(self, x):
 out = F.max_pool2d(torch.tanh(self.conv1(x)), 2)
 out = F.max_pool2d(torch.tanh(self.conv2(out)), 2)
 out = out.view(-1, 8 * 8 * 8)
 out = torch.tanh(self.fc1(out))
 out = self.fc2(out)
 return out

Training the convnet

Two nested loops:

an outer one over the epochs and
- an inner one of the DataLoader that produces batches from our Dataset. In each loop, we then have to
  1. Feed the inputs through the model (the forward pass).
  2. Compute the loss (also part of the forward pass).
  3. Zero any old gradients.
  4. Call loss.backward() to compute the gradients of the loss with respect to all parameters (the backward pass).
  5. Have the optimizer take a step in toward lower loss.

import datetime

def training_loop(n_epochs, optimizer, model, loss_fn, train_loader):
 for epoch in range(1, n_epochs+1):
 loss_train = 0.0
 for imgs, labels in train_loader:

 # feeds a batch through our model
 outputs = model(imgs)

 # computes the loss we wish to minimize
 loss = loss_fn(outputs, labels)

 # get rid of the gradients from the last round
 optimizer.zero_grad()

 # perform the backward step
 # (we compute the gradients of all parameters we want the network to learn)
 loss.backward()

 # update the model
 optimizer.step()

 # sum the losses over the epoch
 loss_train += loss.item() # use .item() to escape the gradients

 if epoch == 1or epoch % 10 == 0:
 print(f'{datetime.datetime.now()}: '
 f'Epoch {epoch}, Training loss: {loss_train / len(train_loader)}')

train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64, shuffle=True)
model = Net()
optimizer = optim.SGD(model.parameters(), lr=1e-2)
loss_fn = nn.CrossEntropyLoss()

training_loop(n_epochs=100, optimizer=optimizer, model=model, loss_fn=loss_fn,
 train_loader=train_loader)

Measuring accuracy

Measure the accuracies on the training set and validation set:

train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64, shuffle=False)
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64, shuffle=False)

def validate(model, train_loader, val_loader):
 for name, loader in [("train", train_loader), ("val", val_loader)]:
 correct, total = 0, 0
 with torch.no_grad():
 for imgs, labels in loader:
 outputs = model(imgs)
 _, predicted = torch.max(outputs, dim=1)
 total += labels.shape[0]
 correct += int((predicted == labels).sum())

 print(f'Accuracy {name} : {(correct / total):.2f}')

validate(model, train_loader, val_loader)

Saving and loading model

Save the model to a file:

# assume that the data_path is already specified
# and we want to save our model with the name "birds_vs_airplanes.pt"

torch.save(model.state_dict(), data_path + 'birds_vs_airplanes.pt')

The birds_vs_airplanes.pt file now contains all the parameters of model: weights and biases for the two convolution modules and the two linear modules. No structure, just the weights.

When we deploy the model in production, we’ll need to keep the model class handy, create an instance, and then load the parameters back into it:

loaded_model = Net()
loaded_model.load_state_dict(torch.load(data_path + 'birds_vs_airplanes.pt'))

<All keys matched successfully>

Training on GPU

nn.Module implements a .to function that moves all of its parameters to the GPU (or casts the type when you pass a dtype argument). There is a somewhat subtle difference between Module.to and Tensor.to.

Module.to is in place: the module instance is modified.
But Tensor.to is out of place (in some ways computation, just like Tensor.tanh), returning a new tensor.

📝 Good practice:

create the Optimizer after moving the parameters to the appropriate device
move things to the GPU if one is available. A good pattern is to set the a variable device depending on torch.cuda.is_available:
```
device = (torch.device('cuda') if torch.cuda.is_available()
 else torch.device('cpu'))
```

Let’s amend the training loop by moving the tensors we get from the data loader to the GPU by using the Tensor.to method.

import datetime

def training_loop(n_epochs, optimizer, model, loss_fn, train_loader):
 for epoch in range(1, n_epochs + 1):
 loss_train = 0.0

 for imgs, labels in train_loader:

 # Move imgs and labels tensors to the device we're training on
 imgs = imgs.to(device=device)
 labels = labels.to(device=device)

 outputs = model(imgs)

 loss = loss_fn(outputs, labels)

 optimizer.zero_grad()
 loss.backward()
 optimizer.step()

 loss_train += loss.item()

 if epoch == 1or epoch % 10 == 0:
 print(f'{datetime.datetime.now()}: '
 f'Epoch {epoch}, Training loss: {loss_train / len(train_loader)}')

Now we can instantiate our model, move it to device, and run it:

(Note: If you forget to move either the model or the inputs to the GPU, you will get errors about tensors not being on the same device, because the PyTorch operators do not support mixing GPU and CPU inputs.)

train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64,
 shuffle=True)

# moves our model (all parameters) to the GPU
model = Net().to(device=device)

# Good practice:
# create the Optimizer after moving the parameters to the appropriate device
optimizer = optim.SGD(model.parameters(), lr=1e-2)

loss_fn = nn.CrossEntropyLoss()

training_loop(n_epochs=100, optimizer=optimizer, model=model, loss_fn=loss_fn,
 train_loader=train_loader)

When loading network weights, PyTorch will attempt to load the weight to the same device it was saved from—that is, weights on the GPU will be restored to the GPU. As we don’t know whether we want the same device, we have two options:

we could move the network to the CPU before saving it,
or move it back after restoring.

It is a bit more concise to instruct PyTorch to override the device information when loading weights. This is done by passing the map_location keyword argument to torch.load:

loaded_model.load_state_dict(torch.load(data_path + 'birds_vs_airplanes.pt',
 map_location=device))

Model design

Width: memory capacity

Width of the network: the number of neurons per layer, or channels per convolution.

Making a model wider is very easy in PyTorch: just specify a larger number of output channels, taking care to change the forward function to reflect the fact that we’ll now have a longer vector once we switch to fully connected layers

For example, we change the number of output channels in the first convolution from 16 to 32:

class NetwWidth(nn.Module):

 def __init__(self, n_chans1=32):
 super().__init__()
 self.n_chans1 = n_chans1
 self.conv1 = nn.Conv2d(3, n_chans1, kernel_size=3, padding=1)
 self.conv2 = nn.Conv2d(n_chans1, n_chans1 // 2, kernel_size=3, padding=1)
 self.fc1 = nn.Linear(8 * 8 * n_chans1 // 2, 32)
 self.fc2 = nn.Linear(32, 2)

 def forward(self, x):
 out = F.max_pool2d(torch.tanh(self.conv1(x)), 2)
 out = F.max_pool2d(torch.tanh(self.conv2(out)), 2)
 out = out.view(-1, 8 * 8 * self.n_chans1 // 2)
 out = torch.tanh(self.fc1(out))
 out = self.fc2(out)
 return out

The greater the capacity, the more variability in the inputs the model will be able to manage; but at the same time, the more likely overfitting will be, since the model can use a greater number of parameters to memorize unessential aspects of the input.

Regularization: helping to converge and generalize

Training a model involves two critical steps:

optimization, when we need the loss to decrease on the training set;
generalization, when the model has to work not only on the training set but also on data it has not seen before, like the validation set.

The mathematical tools aimed at easing these two steps are sometimes subsumed under the label regularization.

Weight penalties

The first way to stabilize generalization is to add a regularization term to the loss.

the weights of the model tend to be small on their own, limiting how much training makes them grow. I.e. it is penalty on larger weight values.
This makes the loss have a smoother topography, and there’s relatively less to gain from fitting individual samples.

The most popular regularization terms are:

L2 regularization: the sum of squares of all weights in the model
L1 regularization: the sum of the absolute values of all weights in the model

Both of them are scaled by a (small) factor, which is a hyperparameter we set prior to training.

Here we’ll focus on L2 regularization.

L2 regularization is also referred to as weight decay.
Adding L2 regularization to the loss function is equivalent to decreasing each weight by an amount proportional to its current value during the optimization step.
Note that weight decay applies to all parameters of the network, such as biases.

In PyTorch, we could implement regularization pretty easily by adding a term to the loss.

def training_loop_l2reg(n_epochs, optimizer, model, loss_fn, train_loader):
 for epoch in range(1, n_epochs + 1):
 for imgs, labels in train_loader:
 imgs = imgs.to(device)
 labels = labels.to(device)

 outputs = model(imgs)
 loss = loss_fn(outputs, labels)

 # L2 regularization
 l2_lambda = 0.001
 l2_norm = sum(p.pow(2.0).sum for p in model.parameters())
 loss = loss + l2_lambda * l2_norm

 optimizer.zero_grad()
 loss.backward()
 optimizer.step()

 loss_train += loss.item()

 if peoch ==1 or epoch % 10 ==0:
 print(f'{datatime.datatime.now()}, Epoch: {epoch},'
 f'Training loss: {loss_train / len(train_loader)}')

The SGD optimizer in PyTorch already has a weight_decay parameter that corresponds to 2 * lambda, and it directly performs weight decay during the update as described previously.

Dropout

💡Idea of dropout: zero out a random fraction of outputs from neurons across the network, where the randomization happens at each training iteration.

This procedure effectively generates slightly different models with different neuron topologies at each iteration, giving neurons in the model less chance to coordinate in the memorization process that happens during overfitting.

In Pytorch, we can implement dropout in a model

by adding an nn.Dropout module between the nonlinear activation function and the linear or convolutional module of the subsequent layer. (As an argument, we need to specify the probability with which inputs will be zeroed out.)
In case of convolutions, we’ll use the specialized nn.Dropout2d or nn.Dropout3d, which zero out entire channels of the input

class NetDropout(nn.Module):
 def __init__(self, n_chans1=32):
 super().__init__()
 self.n_chans1 = n_chans1
 self.conv1 = nn.Conv2d(3, n_chans1, kernel_size=3, padding=1)
 self.conv1_dropout = nn.Dropout2d(p=0.4)
 self.conv2 = nn.Conv2d(n_chans1, n_chans1 // 2, kernel_size=3, padding=1)
 self.conv2_dropout = nn.Dropout2d(p=0.4)
 self.fc1 = nn.Linear(8 * 8 * n_chans1 // 2, 32)
 self.fc2 = nn.Linear(32, 2)

 def forward(self, x):
 out = F.max_pool2d(torch.tanh(self.conv1(x)), 2)
 out = self.conv1_dropout(out)
 out = F.max_pool2d(torch.tanh(self.conv2(out)), 2)
 out = self.conv2_dropout(out)
 out = out.view(-1,8 * 8 * n_chans1 // 2)
 out = torch.tanh(self.fc1(out))
 out = self.fc2(out)
 return out

Note

dropout is normally active during training,
during the evaluation of a trained model in production, dropout is bypassed or, equivalently, assigned a probability equal to zero.
- This is controlled through the train property of the Dropout module. Recall that PyTorch lets us switch between the two modalities by calling model.train() or model.eval() on any nn.Model subclass.

Batch normalization

Batch normalization has multiple beneficial effects on training:

allowing us to increase the learning rate
make training less dependent on initialization and act as a regularizer, thus representing an alternative to dropout.

💡 Main idea behind batch normalization: rescale the inputs to the activations of the network so that minibatches have a certain desirable distribution.

In practical terms:

batch normalization shifts and scales an intermediate input using the mean and standard deviation collected at that intermediate location over the samples of the minibatch.
The regularization effect is a result of the fact that an individual sample and its downstream activations are always seen by the model as shifted and scaled, depending on the statistics across the randomly extracted mini- batch.
using batch normalization eliminates or at least alleviates the need for dropout.

In PyTorch

Batch normalization is provided through the nn.BatchNorm1D, nn.BatchNorm2d, and nn.BatchNorm3d modules, depending on the dimensionality of the input.
the natural location is after the linear transformation (convolution, in this case) and the activation

class NetBatchNorm(nn.Module):
 def __init__(self, n_chans1=32):
 super().__init__()
 self.n_chans1 = n_chans1
 self.conv1 = nn.Conv2d(3, n_chans1, kernel_size=3, padding=1)
 self.conv1_batchnorm = nn.BatchNorm2d(num_features=n_chans1)
 self.conv2 = nn.Conv2d(n_chans1, n_chans1 // 2, kernel_size=3, padding=1)
 self.conv2_batchnorm = nn.BatchNorm2d(num_features=n_chans1 // 2)
 self.fc1 = nn.Linear(8 * 8 * n_chans1 // 2, 32)
 self.fc2 = nn.Linear(32, 2)

 def forward(self, x):
 out = self.conv1_batchnorm(self.conv1(x))
 out = F.max_pool2d(torch.tanh(out), 2)
 out = self.conv_batchnorm(self.conv2(out))
 out = F.max_pool2d(torch.tanh(out), 2)
 out = out.view(-1, 8 * 8 * self.n_chans1 // 2)
 out = torch.tanh(self.fc1(out))
 out = self.fc2(out)
 return out

Note:

Just as for dropout, batch normalization needs to behave differently during training and inference.

As minibatches are processed, in addition to estimating the mean and standard deviation for the current minibatch, PyTorch also updates the running estimates for mean and standard deviation that are representative of the whole dataset, as an approximation.
This way, when the user specifies model.eval() and the model contains a batch normalization module, the running estimates are frozen and used for normalization. To unfreeze running estimates and return to using the minibatch statistics, we call model.train(), just as we did for dropout.

Depth: going deeper to learn more complex structures

The second fundamental dimenison to make a model larger and more capable is depth.

With depth, the complexity of the function the network is able to approximate generally increases.
Depth allows a model to deal with hierarchical information when we need to understand the context in order to say something about some input.

Another way to think about depth: increasing depth is related to increasing the length of the sequence of operations that the network will be able to perform when processing input.

Skip connections

Adding depth to a model generally makes training harder to converge. The bottom line is that a long chain of multiplications will tend to make the contribution of the parameter to the gradient vanish, leading to ineffective training of that layer since that parameter and others like it won’t be properly updated.

Residual networks use a simple trick to allow very deep networks to be successfully trained: using a skip connection to short-circuit blocks of layers

☝️ A skip connection is nothing but the addition of the input to the output of a block of layers.

Let’s add one layer to our simple convolutional model, and let’s use ReLU as the activation for a change. The vanilla module with an extra layer looks like this:

class NetDepth(nn.Module):
 def __init__(self, n_chans1=32):
 super().__init__()
 self.n_chans1 = n_chans1
 self.conv1 = nn.Conv2d(3, n_chans1, kernel_size=3, padding=1)
 self.conv2 = nn.Conv2d(n_chans1, n_chans1 // 2,
 kernel_size=3, padding=1)
 self.conv3 = nn.Conv2d(n_chans1 // 2, n_chans1 // 2,
 kernel_size=3, padding=1)
 self.fc1 = nn.Linear(4 * 4 * n_chans1 // 2, 32)
 self.fc2 = nn.Linear(32, 2)

 def forward(self, x):
 out = F.max_pool2d(torch.relu(self.conv1(x)), 2)
 out = F.max_pool2d(torch.relu(self.conv2(out)), 2)
 out = F.max_pool2d(torch.relu(self.conv3(out)), 2)
 out = out.view(-1, 4 * 4 * self.n_chans1 // 2)
 out = torch.relu(self.fc1(out))
 out = self.fc2(out)
 return out

Adding a skip connection a la ResNet to this model amounts to adding the output of the first layer in the forward function to the input of the third layer:

class NetRes(nn.Module):
 def __init__(self, n_chans1=32):
 super().__init__()
 self.n_chans1 = n_chans1
 self.conv1 = nn.Conv2d(3, n_chans1, kernel_size=3, padding=1)
 self.conv2 = nn.Conv2d(n_chans1, n_chans1 // 2,
 kernel_size=3, padding=1)
 self.conv3 = nn.Conv2d(n_chans1 // 2, n_chans1 // 2,
 kernel_size=3, padding=1)
 self.fc1 = nn.Linear(4 * 4 * n_chans1 // 2, 32)
 self.fc2 = nn.Linear(32, 2)

 def forward(self, x):
 out = F.max_pool2d(torch.relu(self.conv1(x)), 2)
 out = F.max_pool2d(torch.relu(self.conv2(out)), 2)
 out1 = out

 # Adding a skip connection is 
 # adding the output of the first layer in the forward function 
 # to the input of the third layer
 out = F.max_pool2d(torch.relu(self.conv3(out)) + out1, 2)

 out = out.view(-1, 4 * 4 * self.n_chans1 // 2)
 out = torch.relu(self.fc1(out))
 out = self.fc2(out)
 return out

In other words, we’re using the output of the first activations as inputs to the last, in addition to the standard feed-forward path. This is also referred to as identity mapping.

Generally speaking: just arithmetically add earlier intermediate outputs to downstream intermediate outputs.

How does this alleviate the issues with vanishing gradients?

Thinking about backpropagation, a skip connection, or a sequence of skip connections in a deep network, creates a direct path from the deeper parameters to the loss. This makes their contribution to the gradient of the loss more direct, as partial derivatives of the loss with respect to those parameters have a chance not to be multiplied by a long chain of other operations.

It has been observed that skip connections have a beneficial effect on convergence especially in the initial phases of training. Also, the loss landscape of deep residual networks is a lot smoother than feed-forward networks of the same depth and width.👏

Building very deep models in Pytorch

The standard strategy is:

define a building block, such as a (Conv2d, ReLU, Conv2d) + skip connection block
build the network dynamically in a for loop.

We first create a module subclass whose sole job is to provide the computation for one block—that is, one group of convolutions, activation, and skip connection:

class ResBlock(nn.Module):
 def __init__(self, n_chans):
 super(ResBlock, self).__init__()
 self.conv = nn.Conv2d(n_chans, n_chans, kernel_size=3,
 padding=1, bias=False)
 self.batch_norm = nn.BatchNorm2d(num_features=n_chans)

 # Use custom initializations as in the ResNet paer
 torch.nn.init.kaiming_normal_(self.conv.weight, nonlinearity='relu')

 # The batch norm is initialized to produce output distributions 
 # that initially have 0 mean and 0.5 variance
 torch.nn.init.constant_(self.batch_norm.weight, 0.5)
 torch.nn.init.zeros_(self.batch_norm.bias)

 def forward(self, x):
 out = self.conv(x)
 out = self.batch_norm(out)
 out = torch.relu(out)
 return out

We’d now like to generate a 100-block network.

First, in init, we create nn.Sequential containing a list of ResBlock instances. nn.Sequential will ensure that the output of one block is used as input to the next. It will also ensure that all the parameters in the block are visible to Net.
Then, in forward, we just call the sequential to traverse the 100 blocks and generate the output

class NetResDeep(nn.Module):
 def __init__(self, n_chans1=32, n_blocks=10):
 super().__init__()
 self.n_chans1 =n_chans1
 self.conv1 = nn.Conv2d(3, n_chans1, kernel_size=3, padding=1)

 # create a list of ResBlocks
 self.resblocks = nn.Sequential(* (n_blocks * [ResBlock(n_chans=n_chans)]))

 self.fc1 = nn.Linear(8 * 8 * n_chans1, 32)
 self.fc2 = nn.Linear(32, 2)

 def forward(self, x):
 out = F.max_pool2d(torch.relu(self.conv1(x)), 2)

 # traverse the list of blocks
 out = self.resblocks(out)

 out = F.max_pool2d(out, 2)
 out = out.veiw(-1, 8 * 8 * self.n_chans1)
 out = torch.relu(self.fc1(out))
 out = self.fc2(out)
 return out

DL-With-PyTorch | Haobin Tan

Deep Learning with PyTorch

Book

Code

Pretrained Networks

Pretrained Network for Object Recognition

Use pretrained network in TorchVision

Load and show an image from the local filesystem

Set eval mode before inference

Retrieve image label

Torch Hub

PyTorch Tensor

The world as floating-point numbers

Tensors: Multidimensional arrays

Tensor construction

The essence of tensors

Indexing tensors

Tensor element types

Specifying the numeric type with dtype

Typical dtype

Casting dtype

The Tensor API

Tensors: Scenic views of storage

Indexing into storage

Modifying stored values: In-place operations

🧐 Tensor metadata: Size, offset, and stride

Cloning a tensor

Transposing without copying

Transposing in higher dimensions

Moving tensors between CPU and GPU

Managing a tensor’s device attribute

NumPy interoperability

Serializing tensors

Real-world Data Representation Using Tensors

Images

Loading an image file

Change the layout to PyTorch supported layout

Create a dataset of multiple images

Normalizing the data

Tabular data

Continuous, ordinal, and categorical values

Loading tabular data

One-hot encoding

When to categorise?

Text

Converting text to numbers

One-hot-encoding characters

One-hot encoding whole words

Text embeddings

The Mechanics of Learning

Learning is just parameter estimation

A simple linear model

Modeling with PyTorch

Down along the gradient

PyTorch’s autograd

Applying autograd

Using the grad attribute

PyTorch’s optimizers

Training, validation, and overfitting

Evaluating the training loss

Generalizing to the validation set

Splitting a dataset

Observing the training

Switching autograd off for validation

Run with autograd enabled or disabled

Using Neural Network to Fit Data

Artficial neurons

Composing a multilayer network

The error function

Activation functions

Choosing the best activation function

🤔 What learning means for a neural network

The PyTorch nn module

Using __call__ rather than forward

Dealing with batches

Loss functions

Building neural networks using PyTorch

nn.Sequential container

Inspecting parameters

Learning from Images

Use pretrained network in `TorchVision`

Set `eval` mode before inference

Specifying the numeric type with `dtype`

Typical `dtype`

Casting `dtype`

Indexing into `storage`

Managing a tensor’s `device` attribute

PyTorch’s `autograd`

Applying `autograd`

Using the `grad` attribute

Switching `autograd` off for validation

Run with `autograd` enabled or disabled

The PyTorch `nn` module

Using `call` rather than `forward`

`nn.Sequential` container

`Dataset` class

`ToTensor`

Subclassing `nn.Module`