PyTorch Recipe | Haobin Tan

🧾 PyTorch Recipes

Mon, 07 Sep 2020 00:00:00 +0000

This section provides a lot of useful recipes that make use of specific PyTorch features.

🔥 Transfer Learning for Computer Vision

Tue, 03 Nov 2020 00:00:00 +0000

Handling settings for training and valiadtion phase flexibly

💡 Use Python dictionary

Phase ('train' or 'val') as key

For example:

data_transforms = {
 # For training: data augmentation and normalization
 'train': transforms.Compose([
 transforms.RandomResizedCrop(224),
 transforms.RandomHorizontalFlip(),
 transforms.ToTensor(),
 transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
 ]),

 # For validation: only normalization
 'val': transforms.Compose([
 transforms.Resize(256),
 transforms.CenterCrop(224),
 transforms.ToTensor(),
 transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
 ])
}

data_dir = 'hymenoptera_data'

image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
 data_transforms[x])
 for x in ['train', 'val']}

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
 shuffle=True, num_workers=4)
 for x in ['train', 'val']}

dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}

General function to train a model

Here we will

schedule the learning rate
save hte best model

def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
 """
 scheduler is an LR scheduler object from torch.optim.lr_scheduler
 """

 since = time.time()

 best_model_wts = copy.deepcopy(model.state_dict())
 best_acc = 0.0

 for epoch in range(1, num_epochs + 1):
 print(f'Epoch {epoch}')
 print('-' * 10)

 # each epoch has a training and validation phase
 for phase in ['train', 'val']:
 if phase == 'train':
 model.train() # set model to training mode
 else:
 model.eval() # set model to evaluate mode

 running_loss = 0.0
 running_corrects = 0

 for inputs, labels in dataloaders[phase]:
 inputs = inputs.to(device)
 labels = labels.to(device)

 # zero the params gradients
 optimizer.zero_grad()

 # forward
 # track history if only in train
 with torch.set_grad_enabled(phase == 'train'):
 outputs = model(inputs)
 _, preds = torch.max(outputs, 1)
 loss = criterion(outputs, labels)

 # backward + optimize only in trianing phase
 if phase == 'train':
 loss.backward()
 optimizer.step()

 # statistics
 running_loss += loss.item() * inputs.shape[0]
 running_corrects += torch.sum(preds == labels.data)

 if phase == 'train':
 scheduler.step()

 epoch_loss = running_loss / dataset_sizes[phase]
 epoch_acc = running_corrects.double() / dataset_sizes[phase]

 print(f'{phase} Loss: {epoch_loss:.4f}, Acc: {epoch_acc:.4f}')

 # deep copy the model
 if phase == 'val' and epoch_acc > best_acc:
 best_acc = epoch_acc
 best_model_wts = copy.deepcopy(model.state_dict())

 print()

 time_elapsed = time.time() - since

 print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s')
 print(f'Best val Acc: {best_acc:.4f}')

 # load best model weights
 model.load_state_dict(best_model_wts)
 return model

Major Transfer Learning scenarios

In practice, very few people train an entire Convolutional Network from scratch (with random initialization), because it is relatively rare to have a dataset of sufficient size. Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest.

ConvNet as fixed feature extractor

Take a ConvNet pretrained on ImageNet
Remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet)
Treat the rest of the ConvNet as a fixed feature extractor for the new dataset. (We call these features CNN codes.)

Implementation with PyTorch

we will freeze the weights for all of the network except that of the final fully connected layer.
This last fully connected layer is replaced with a new one with random weights and only this layer is trained.

# Load pretrained model
model_conv = torchvision.models.resnet18(pretrained=True)

# Freeze all the network
for param in model_conv.parameters():
 param.requires_grad = False

# Parameters of newly constructed modules have requires_grad=True by default
# in other words, now we freeze all the network except the final layer
num_features = model_conv.fc.in_features
model_conv.fc = nn.Linear(num_features, 2)

model_conv = model_conv.to(device)

criterion = nn.CrossEntropyLoss()

# Observe that only parameters of final layer are being optimized as
# opposed to before.
optimizer_conv = optim.SGD(model_conv.fc.parameters(), lr=0.001, momentum=0.9)

# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_conv, step_size=7, gamma=0.1)

Train and evaluate:

model_conv = train_model(model_conv, criterion, optimizer_conv,
 exp_lr_scheduler, num_epochs=25)

Fine-tuning the ConvNet

The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network.

Motivation: the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.

Implementation with PyTorch

Instead of random initializaion, we initialize the network with a pretrained network, like the one that is trained on imagenet 1000 dataset.
Rest of the training looks as usual.

# Load a pretrained model
model_ft = models.resnet18(pretrained=True)

# Reset the final fully connected layer according to specific task
num_features = model_ft.fc.in_features
num_classes = 2 # assuming a binary classification task
model_ft.fc = nn.Linear(num_features, num_classes)

model_ft = model_ft.to(device)

criterion = nn.CrossEntropyLoss()

optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)

# Decay learning rate by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

Train and evaluate:

model_conv = train_model(model_ft, criterion=criterion, optimizer=optimizer_ft,
 scheduler=exp_lr_scheduler, num_epochs=25)

When and how to fine-tune?

The two most important factors are:

size of the new dataset (small or big)
its similarity to the original dataset

Keeping in mind that ConvNet features are more generic in early layers and more original-dataset-specific in later layers.

Common rules of thumb for navigating the 4 major scenarios:

New dataset is small and similar to original dataset.

Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.
New dataset is large and similar to the original dataset.

Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset.

Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.
New dataset is large and very different from the original dataset.

Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.

Pratical advices

Constraints from pretrained models.
- Note that if you wish to use a pretrained network, you may be slightly constrained in terms of the architecture you can use for your new dataset. For example, you can’t arbitrarily take out Conv layers from the pretrained network.
- However, some changes are straight-forward: Due to parameter sharing, you can easily run a pretrained network on images of different spatial size. This is clearly evident in the case of Conv/Pool layers because their forward function is independent of the input volume spatial size (as long as the strides “fit”).
- In case of FC layers, this still holds true because FC layers can be converted to a Convolutional Layer: For example, in an AlexNet, the final pooling volume before the first FC layer is of size [6x6x512]. Therefore, the FC layer looking at this volume is equivalent to having a Convolutional Layer that has receptive field size 6x6, and is applied with padding of 0.
Learning rates.
- It’s common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the (randomly-initialized) weights for the new linear classifier that computes the class scores of your new dataset.
- This is because we expect that the ConvNet weights are relatively good, so we don’t wish to distort them too quickly and too much (especially while the new Linear Classifier above them is being trained from random initialization).

Google Colab Notebook

Colab Notebook

Reference

Saving and Loading Checkpoints

Fri, 06 Nov 2020 00:00:00 +0000

Motivation

Saving and loading a general checkpoint model for inference or resuming training can be helpful for picking up where we last left off.

When saving a general checkpoint, you must save more than just the model’s state_dict. It is important to also save the optimizer’s state_dict, as this contains buffers and parameters that are updated as the model trains. Other items that you may want to save are the

epoch you left off on,
the latest recorded training loss,
external torch.nn.Embedding layers,
and more, based on your own algorithm.

How to save and load checkpoints?

To save multiple checkpoints, we must organize them in a dictionary and use torch.save() to serialize the dictionary. A common PyTorch convention is to save these checkpoints using the .tar file extension.

To load the items,

first initialize the model and optimizer,
then load the dictionary locally using torch.load(). From here, we can easily access the saved items by simply querying the dictionary as you would expect.

Example

1. Import necessary libraries for loading our data

import torch
import torch.nn as nn
import torch.optim as optim

2. Define and intialize the neural network

class Net(nn.Module):
 def __init__(self):
 super(Net, self).__init__()
 self.conv1 = nn.Conv2d(3, 6, 5)
 self.pool = nn.MaxPool2d(2, 2)
 self.conv2 = nn.Conv2d(6, 16, 5)
 self.fc1 = nn.Linear(16 * 5 * 5, 120)
 self.fc2 = nn.Linear(120, 84)
 self.fc3 = nn.Linear(84, 10)

 def forward(self, x):
 x = self.pool(F.relu(self.conv1(x)))
 x = self.pool(F.relu(self.conv2(x)))
 x = x.view(-1, 16 * 5 * 5)
 x = F.relu(self.fc1(x))
 x = F.relu(self.fc2(x))
 x = self.fc3(x)
 return x

net = Net()

3. Initialize the optimizer

optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

4. Saving the general checkpoint

Collect all relevant information,
Build our checkpoint dictionary.
Save checkpoint using torch.save()

# Additional information
EPOCH = 5
PATH = "model.pt"
LOSS = 0.4 # just dummy number

torch.save({'epoch': EPOCH,
 'model_state_dict': net.state_dict(),
 'optimizer_state_dict': optimizer.state_dict(),
 'loss': LOSS
 }, PATH)

5. Load the general checkpoint

First initialize the model and optimizer
Then load the checkpoint dictionary locally

# initialize the model and optimizer
model = Net()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# load checkpoint
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

Call eval() for inference or train() for training

Google Colab Notebook

Colab Notebook

Reference

SAVING AND LOADING A GENERAL CHECKPOINT IN PYTORCH

nn ModuleList vs. Sequential

Mon, 09 Nov 2020 00:00:00 +0000

import torch
import torch.nn as nn
import torch.nn.functional as F

`nn.Module`

Defines the base class for all neural network
We MUST subclass it

Example

class Net(nn.Module):
 def __init__(self, in_c, n_classes):
 super().__init__()
 self.conv1 = nn.Conv2d(in_c, 32, kernel_size=3, stride=1, padding=1)
 self.bn1 = nn.BatchNorm2d(32)

 self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
 self.bn2 = nn.BatchNorm2d(32)

 self.fc1 = nn.Linear(32 * 28 * 28, 1024)
 self.fc2 = nn.Linear(1024, n_classes)

 def forward(self, x):
 x = self.conv1(x)
 x = self.bn1(x)
 x = F.relu(x)

 x = self.conv2(x)
 x = self.bn2(x)
 x = F.relu(x)

 x = x.view(-1, 32 * 28 * 28) # flat

 x = self.fc1(x)
 x = F.sigmoid(x)
 x = self.fc2(x)

 return x

model = Net(1, 10)
model

Net(
 (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
 (bn1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
 (bn2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 (fc1): Linear(in_features=25088, out_features=1024, bias=True)
 (fc2): Linear(in_features=1024, out_features=10, bias=True)
)

`nn.Sequential`

Sequential is a container of Modules that can be stacked together and run at the same time.

The nn.Module’s stored in nn.Sequential are connected in a cascaded way
nn.Sequential has a forward() method
- Have to make sure that the output size of a block matches the input size of the following block.
Basically, it behaves just like a nn.Module

Example

class NetSequential(nn.Module):
 def __init__(self, in_c, n_classes):
 super().__init__()
 self.conv_block1 = nn.Sequential(
 nn.Conv2d(in_c, 32, kernel_size=3, stride=1, padding=1),
 nn.BatchNorm2d(32),
 nn.ReLU()
 )

 self.conv_block2 = nn.Sequential(
 nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
 nn.BatchNorm2d(64),
 nn.ReLU()
 )

 self.decoder = nn.Sequential(
 nn.Linear(32 * 28 * 28, 1024),
 nn.Sigmoid(),
 nn.Linear(1024, n_classes)
 )

 def forward(self, x):
 x = self.conv_block1(x)
 x = self.conv_block2(x)
 x = x.view(-1, 32 * 28 * 28)
 x = self.decode(x)
 return x

model = NetSequential(1, 10)
model

NetSequential(
 (conv_block1): Sequential(
 (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
 (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 (2): ReLU()
 )
 (conv_block2): Sequential(
 (0): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
 (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 (2): ReLU()
 )
 (decoder): Sequential(
 (0): Linear(in_features=25088, out_features=1024, bias=True)
 (1): Sigmoid()
 (2): Linear(in_features=1024, out_features=10, bias=True)
 )
)

`nn.ModuleList`

Documentation:

Holds submodules in a list.

ModuleList can be indexed like a regular Python list, but modules it contains are properly registered, and will be visible by allModule methods.

Does NOT have a forward() method, because it does not define any neural network, that is, there is no connection between each of the nn.Module’s that it stores.
We may use it to store nn.Module’s, just like you use Python lists to store other types of objects (integers, strings, etc). And Pytorch is “aware” of the existence of the nn.Module’s inside an nn.ModuleList
Execution order of nn.Modules stored in nn.ModuleList is defined in forward(), which we have to implement explicitly by ourselves.

Example

class NetModuleList(nn.Module):
 def __init__(self, in_c, n_classes):
 super().__init__()
 self.module_list = nn.ModuleList([
 nn.Conv2d(in_c, 32, kernel_size=3, stride=1, padding=1),
 nn.BatchNorm2d(32),
 nn.ReLU(),
 nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
 nn.BatchNorm2d(64),
 nn.ReLU(),
 nn.Flatten(),
 nn.Linear(32 * 28 * 28, 1024),
 nn.Sigmoid(),
 nn.Linear(1024, n_classes)
 ])

 def forward(self, x):
 for module in self.module_list:
 x = module(x)
 return x

model = NetModuleList(1, 10)
model

NetModuleList(
 (module_list): ModuleList(
 (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
 (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 (2): ReLU()
 (3): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
 (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
 (5): ReLU()
 (6): Flatten(start_dim=1, end_dim=-1)
 (7): Linear(in_features=25088, out_features=1024, bias=True)
 (8): Sigmoid()
 (9): Linear(in_features=1024, out_features=10, bias=True)
 )
)

`nn.Sequential` vs. `nn.ModuleList`

	`nn.Sequential`	`nn.ModuleList`
Has `forward()` ?	✅	❌
Connection between `nn.Modules` stored inside?	✅	❌
Execution order = stored order?	✅	❌
Advantages	succinct	flexible

When to use which?

Use Module when we have a big block compose of multiple smaller blocks
Use Sequential when we want to create a small block from layers
Use ModuleList when we need to iterate through some layers or building blocks and do something

Reference

🔥 Custom Datasets and Transforms

Thu, 26 Nov 2020 00:00:00 +0000

Custom Dataset

In order to use our custom dataset, we need to

inherit torch.utils.data.Dataset , an abstract class representing a dataset.
override
- __len__ so that len(dataset) returns the size of the dataset.
- __getitem__ to support the indexing such that dataset[i] can be used to get i-th sample.

The skeleton is as follows:

from torch.utils.data.dataset import Dataset

class MyCustomDataset(Dataset):
 def __init__(self, ...):
 # initial logic, e.g.
 # read csv
 # assign data transformation
 # ...

 def __getitem__(self, index):
 """Get the {index}-th sample"""
 # Note: the return value can be customized depending on application
 return (img, label)

 def __len__(self):
 return count # of how many examples(images?) you have

Example

Let’s take MNIST dataset as example. Assuming we have the csv file located in CSV_PATH. The structure of our csv file is

One instance/sample per line
- The first column is the digit label (0 - 9)
- The rest 784 columns represents the values of each pixel in the image of size 28x28 ($28 \times 28 = 784$)
- I.e. each sample consists of an image of digit and the label of the digit
There’re 5000 lines in total. I.e. 5000 samples
- We want to use the first 4000 samples for training and validation,
- and the rest 1000 samples for testing.

Let’s implement our custom MNIST dataset:

from torch.utils.data import Dataset

class MyMNIST(Dataset):

 TRAIN, VALID, TEST = 0, 1, 2

 def __init__(self, csv_file, usage=TRAIN, transform=None, label_transform=None):
 """
 Args:
 csv_file (string): Path to the csv file
 usage (int): usage of the dataset (train/validation/test)
 transform (callable, optional): Optional transform to be applied on the image.
 label_transform (callable, optional): Optional transform to be applied on the label.
 """

 self.transform = transform # image preprocessing
 self.label_transform = label_transform # label preprocessing

 # load from csv file
 all_data = np.genfromtxt(csv_file, delimiter=',', dtype='uint8')

 # 5000 lines in csv file --> 5000 instances
 # training set: first 3000 lines
 # validation set: 3000 - 4000 
 # test set: last 1000 lines
 train, test = all_data[:4000], all_data[4000:]
 train, val = train[:3000], train[3000:]

 # choose lines based on specified usage
 if usage == self.TRAIN:
 self.images = train[:, 1:]
 self.labels = train[:, 0] # first column is label of the digit 
 elif usage == self.VALID:
 self.images = val[:, 1:]
 self.labels = val[:, 0]
 else:
 self.images = test[:, 1:]
 self.labels = test[:, 0]


 def __getitem__(self, index):
 image, label = self.images[index], self.labels[index]

 if self.transform is not None:
 image = self.transform(image)

 if self.label_transform is not None:
 label = self.label_transform(label)
 # convert label to Tensor of dtype long
 label = torch.as_tensor(label, dtype=torch.long)

 return image, label


 def __len__(self):
 return len(self.labels)

Use our custom MNIST dataset:

from torchvision import transforms

# apply normalizaton and convertion to Tensor before using the dataset
preprocess_transform = transforms.Compose([transforms.ToTensor(),
 transforms.Normalize((0.1,), (0.4))])

# let's say we use the dataset for testing
my_mnist = MyMNIST(csv_file=CSV_PATH,
 usage=MyMNIST.TEST,
 transform=preprocess_transform)

Custom transform and augmentation

The example code above takes use of the transforms provided by torchvision.transforms. We can also implement custom transforms by ourselves.

To do this, we need to write them as callable classes:

inherit object class
implement __init___ if needed
define desired transformations in __call__(self, image) method

Example

For example, let’s implement two custom transforms:

class MyNormalizer(object):
 """Normalize image"""

 def __call__(self, image):
 """
 Only works for our custom MNIST dataset: Devide the pixel values by 255
 Generally, normalization should work as follows:
 data_normalized = (data - data.mean) / data.std
 """
 image = image * 1.0 / 255
 return image


class MyToTensor(object):
 """Convert image to PyTorch Tensor"""

 def __call__(self, image):
 image = torch.from_numpy(image).float()
 return image

Use custom transform in our custom MNIST dataset

preprocess_transform = transforms.Compose([MyToTensor(),
 MyNormalizer()])

my_mnist = MyMNIST(csv_file=CSV_PATH,
 usage=MyMNIST.TEST,
 transform=preprocess_transform)

Reference

🔥🧾 General Training Steps Using PyTorch

Thu, 26 Nov 2020 00:00:00 +0000

Open in Google Colab

General steps:

Set device
Set Dataset and DataLoader
Define network model
Build network model
Define loss function and optimizer
Define training process
Train the model
Store/Load weights

Saving and Loading Models

Sun, 17 Jan 2021 00:00:00 +0000

Three core functions for saving and loading models:

torch.save

Saves a serialized object to disk. This function uses Python’s pickle utility for serialization. Models, tensors, and dictionaries of all kinds of objects can be saved using this function.
torch.load

Uses pickle’s unpickling facilities to deserialize pickled object files to memory.
torch.nn.Module.load_state_dict

Loads a model’s parameter dictionary using a deserialized state_dict.

`state_dict`

In PyTorch,

the learnable parameters (i.e. weights and biases) of an torch.nn.Module model are contained in the model’s parameters (accessed with model.parameters()). A state_dict is simply a Python dictionary object that maps each layer to its parameter tensor.
- Note that only layers with learnable parameters (convolutional layers, linear layers, etc.) and registered buffers (batchnorm’s running_mean) have entries in the model’s state_dict.
Optimizer objects (torch.optim) also have a state_dict, which contains information about the optimizer’s state, as well as the hyperparameters used.

Because state_dict objects are Python dictionaries, they can be easily saved, updated, altered, and restored, adding a great deal of modularity to PyTorch models and optimizers.

Example

import torch
import torch.nn as nn
import torch.functional as F
import torch.optim as optim

class TheModelClass(nn.Module):
 def __init__(self):
 super(TheModelClass, self).__init__()
 self.conv1 = nn.Conv2d(3, 6, 5)
 self.pool = nn.MaxPool2d(2, 2)
 self.conv2 = nn.Conv2d(6, 16, 5)
 self.fc1 = nn.Linear(16 * 5 * 5, 120)
 self.fc2 = nn.Linear(120, 84)
 self.fc3 = nn.Linear(84, 10)

 def forward(self, x):
 x = self.pool(F.relu(self.conv1(x)))
 x = self.pool(F.relu(self.conv2(x)))
 x = x.view(-1, 16 * 5 * 5)
 x = F.relu(self.fc1(x))
 x = F.relu(self.fc2(x))
 x = self.fc3(x)
 return x

# Initialize model
model = TheModelClass()

# Initialize optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Print model's state_dict
print("Model's state_dict:")
for param_tensor in model.state_dict():
 print(param_tensor, "\t", model.state_dict()[param_tensor].size())

Model's state_dict:
conv1.weight torch.Size([6, 3, 5, 5])
conv1.bias torch.Size([6])
conv2.weight torch.Size([16, 6, 5, 5])
conv2.bias torch.Size([16])
fc1.weight torch.Size([120, 400])
fc1.bias torch.Size([120])
fc2.weight torch.Size([84, 120])
fc2.bias torch.Size([84])
fc3.weight torch.Size([10, 84])
fc3.bias torch.Size([10])

# Print optimizer's state_dict
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
 print(var_name, "\t", optimizer.state_dict()[var_name])

Optimizer's state_dict:
state {}
param_groups [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]

Saving & Loading Model for Inference

Save/Load `state_dict` (Recommended)

Save:

torch.save(model.state_dict(), PATH)

Load:

model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.eval()

When saving a model for inference, it is only necessary to save the trained model’s learned parameters. Saving the model’s state_dict with the torch.save() function will give you the most flexibility for restoring the model later.

A common PyTorch convention is to save models using either a .pt or .pth file extension.

Save/Load entire model

Save:

torch.save(model, PATH)

Load:

# Model class must be defined somewhere
model = torch.load(PATH)
model.eval()

This save/load process uses the most intuitive syntax and involves the least amount of code. Saving a model in this way will save the entire module using Python’s pickle module.

🔴 The disadvantage of this approach is that the serialized data is bound to the specific classes and the exact directory structure used when the model is saved. The reason for this is because pickle does not save the model class itself. Rather, it saves a path to the file containing the class, which is used during load time. Because of this, your code can break in various ways when used in other projects or after refactors.

A common PyTorch convention is to save models using either a .pt or .pth file extension.

Saving & Loading a General Checkpoint for Inference and/or Resuming Training

See: Saving and Loading Checkpoints

Reference

Data Augmentation

Sun, 17 Jan 2021 00:00:00 +0000

What is data augmentation?

To solve the problem that it’s hard to get enough data for training neural networks, image augmentation is a process of creating new training examples from the existing ones. To make a new sample, you slightly change the original image.

For instance, you could make a new image a little brighter; you could cut a piece from the original image; you could make a new image by mirroring the original one, etc. Here are some examples of transformations of the original image that will create a new training sample:

By applying those transformations to the original training dataset, you could create an almost infinite amount of new training samples.

Premise of data augmentation

A convolutional neural network that can robustly classify objects even if its placed in different orientations is said to have the property called invariance. More specifically, a CNN can be invariant to translation, viewpoint, size or illumination (Or a combination of the above).

When to apply augmentation?

The answer may seem quite obvious; we do augmentation before we feed the data to the model.

However, we have two options here:

Offline augmentation
- Preferred for relatively smaller datasets
- Increasing the size of the dataset by a factor equal to the number of transformations we perform
  - For example, by flipping all my images, I would increase the size of my odataset by a factor of 2
Online augmentation / Augmentation on the fly
- Preferred for larger datasets, as we can’t afford the explosive increase in size.
- Perform transformations on the mini-batches that we would feed to our model.

Use data augmentation in the right way

‼️ Do NOT increase irrelevant data!!!

Sometimes not all augmentation techniques make sense for a dataset. Consider the following car example:

The first image (from the left) is the original, the second one is flipped horizontally, the third one is rotated by 180 degrees, and the last one is rotated by 90 degrees (clockwise).

They are pictures of the same car, but our target application may NEVER see cars presented in these orientations. For example, if we’re gonna classify random cars on the road, only the second image would make sense to be in the dataset.

How to conduct data augmentation in PyTorch?

Use `torchvision.transforms`

Provides common image transformations
Can be chained together using transforms.Compose

🔥 Use `albumentations`

Demo

Demo for viewing different augmentation transformations

When will data augmentation be applied in PyTorch?

In any epoch the dataloader will apply a fresh set of random operations “on the fly”. I.e. the augmentation happens inside of this line:

for (data, target) in dataloader:

Instead of showing the exact same items at every epoch, you are showing a variant that has been changed in a different way. So after three epochs, you would have seen three random variants of each item in a dataset.

Note that each image will be transformed randomly on-the-fly, thus NO images will be generated and the length of Dataset stays the SAME.

If you want to perferm more augmentation and bring more varaibility for the dataset, just increase the number of epochs.

Reference:

Data augmentation in PyTorch

Transform and Image Data Augmentation

Basic question about torchvision.transforms

Reference

Data Augmentation | How to use Deep Learning when you have Limited Data — Part 2

TorchScript

Wed, 21 Apr 2021 00:00:00 +0000

TorchScript

A PyTorch model’s journey from Python to C++ is enabled by Torch Script, a representation of a PyTorch model that can be understood, compiled and serialized by the Torch Script compiler.
Any TorchScript program can be saved from a Python process and loaded in a process where there is NO Python dependency. In other words, a TorchScript program can be run independently from Python, such as in a standalone C++ program.
This makes it possible to train models in PyTorch using familiar tools in Python and then export the model via TorchScript to a production environment where Python programs may be disadvantageous for performance and multi-threading reasons.
👍 Advantage
- TorchScript code can be invoked in its own interpreter, which is basically a restricted Python interpreter. This interpreter does not acquire the Global Interpreter Lock, and so many requests can be processed on the same instance simultaneously.
- This format allows us to save the whole model to disk and load it into another environment, such as in a server written in a language other than Python
- TorchScript gives us a representation in which we can do compiler optimizations on the code to provide more efficient execution
- TorchScript allows us to interface with many backend/device runtimes that require a broader view of the program than individual operators.

Steps for Loading a PyTorch Model in C++

Converte PyTorch Model to TorchScript
Serialize script module to a file
Load script module in C++
Execute script module in C++

Convert PyTorch Model to Torch Script

There are wo ways to convert a PyTorch model to Torch Script

Tracing
Scripting

Tracing

A mechanism in which
- the structure of the model is captured by evaluating it once using example inputs and
- recording the flow of those inputs through the model.
Suitable for models that make limited use of control flow
Function: torch.jit.trace

Example

import torch

class MyCell(torch.nn.Module):
 def __init__(self):
 super(MyCell, self).__init__()
 self.linear = torch.nn.Linear(4, 4)

 def forward(self, x, h):
 new_h = torch.tanh(self.linear(x) + h)
 return new_h, new_h

my_cell = MyCell()
x, h = torch.rand(3, 4), torch.rand(3, 4)
traced_cell = torch.jit.trace(my_cell, (x, h))

What happens under the hood when we call torch.jit.trace, passing in the Module and an example input?

It has invoked the Module
Recorded the operations that occured when the Module was run
Created an instance of torch.jit.ScriptModule

TorchScript records its definitions in an Intermediate Representation (or IR), commonly referred to in Deep learning as a graph (we can examine the graph with the .graph property).

A better way is to use the .code property to give a Python-syntax interpretation of the code:

print(traced_cell.code)

Out:

def forward(self,
 input: Tensor,
 h: Tensor) -> Tuple[Tensor, Tensor]:
 _0 = torch.add((self.linear).forward(input, ), h, alpha=1)
 _1 = torch.tanh(_0)
 return (_1, _1)

Scripting

If our code use control flows (if-else, loop…), then tracing is unsuitable. In this case, we will use a script compiler, which does code analysis of our Python source code to transform it into TorchScript. The function for compiling the module is torch.jit.script.

Example

import torch

class MyModule(torch.nn.Module):
 def __init__(self, N, M):
 super(MyModule, self).__init__()
 self.weight = torch.nn.Parameter(torch.rand(N, M))

 def forward(self, input):
 if input.sum() > 0:
 output = self.weight.mv(input)
 else:
 output = self.weight + input
 return output

my_module = MyModule(10,20)
sm = torch.jit.script(my_module)

sm is an instance of ScriptModule that is ready for serialization.

Mixing Scripting and Tracing

In many cases either tracing or scripting is an easier approach for converting a model to TorchScript. Tracing and scripting can be composed to suit the particular requirements of a part of a model.

Scripted functions can call traced functions.

Useful when we need to use control-flow around a simple feed-forward model

Example

import torch

def foo(x, y):
 return 2 * x + y

traced_foo = torch.jit.trace(foo, (torch.rand(3), torch.rand(3)))

@torch.jit.script
def bar(x):
 return traced_foo(x, x)

Traced functions can call script functions.

Useful when a small part of a model requires some control-flow even though most of the model is just a feed-forward network.
Control-flow inside of a script function called by a traced function is preserved correctly.

Example

import torch

@torch.jit.script
def foo(x, y):
 if x.max() > y.max():
 r = x
 else:
 r = y
 return r


def bar(x, y, z):
 return foo(x, y) + z

traced_bar = torch.jit.trace(bar, (torch.rand(3), torch.rand(3), torch.rand(3)))

Saving aand Loading Script Module

Save: save()
Load: torch.jit.load()

Example:

import torch
import torchvision

# An instance of your model.
model = torchvision.models.resnet18()

# An example input you would normally provide to your model's forward() method.
example = torch.rand(1, 3, 224, 224)

# Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing.
traced_script_module = torch.jit.trace(model, example)

Save:

traced_script_module.save("traced_resnet_model.pt")

Load:

traced_resnet = torch.jit.load("traced_resnet_model.pt")

Reference

Performance Measurement

Mon, 24 May 2021 00:00:00 +0000

Main Issues of Time Measurement

GPU Execution Mechanism: Asynchronous Execution

In multithreaded or multi-device programming, two blocks of code that are independent can be executed in parallel. This means that the second block may be executed before the first is finished. This process is referred to as asynchronous execution.

In the deep learning context, we often use this execution because the GPU operations are asynchronous by default.

More specifically, when calling a function using a GPU, the operations are enqueued to the specific device, but not necessarily to other devices. This allows us to execute computations in parallel on the CPU or another GPU.

Asynchronous execution offers huge advantages for deep learning, such as the ability to decrease run-time by a large factor.

For example, at the inference of multiple batches, the second batch can be preprocessed on the CPU while the first batch is fed forward through the network on the GPU. Clearly, it would be beneficial to use asynchronism whenever possible at inference time.

However, asynchronous execution can be the cause of many headaches when it comes to time measurements.

When you calculate time with the time library in Python, the measurements are performed on the CPU device. Due to the asynchronous nature of the GPU, the line of code that stops the timing will be executed before the GPU process finishes. As a result, the timing will be inaccurate or irrelevant to the actual inference time.

GPU Warm-up

A modern GPU device can exist in one of several different power states.

When the GPU is NOT being used for any purpose and persistence mode (i.e., which keeps the GPU on) is not enabled, the GPU will automatically reduce its power state to a very low level, sometimes even a complete shutdown. In lower power state, the GPU shuts down different pieces of hardware, including memory subsystems, internal subsystems, or even compute cores and caches.

In low power state, the invocation of any program that attempts to interact with the GPU will cause the driver to load and/or initialize the GPU. This driver load behavior is noteworthy! Applications that trigger GPU initialization can incur up to 3 seconds of latency, due to the scrubbing behavior of the error correcting code.

For instance, if we measure time for a network that takes 10 milliseconds for one example, running over 1000 examples may result in most of our running time being wasted on initializing the GPU.

The Correct Way to Measure Inference Time

Before we make any time measurements, we run some dummy examples through the network to do a ‘GPU warm-up.’ This will automatically initialize the GPU and prevent it from going into power-saving mode when we measure time.
Next, we use torch.cuda.event to measure time on the GPU.
- It is crucial here to use torch.cuda.synchronize(). This line of code performs synchronization between the host and device (i.e., GPU and CPU), so the time recording takes place only after the process running on the GPU is finished. This overcomes the issue of unsynchronized execution.

Code Snippet

import torch
import torchvision.models as models
import numpy as np
from tqdm import tqdm


device = torch.device("cuda")
model = models.resnet18(pretrained=True).to(device)
dummy_input = torch.randn([1, 3, 1024, 2048], dtype=torch.float).to(device)

# Init loggers
WARMUP_REPETITION = 100
MEASURE_REPETITION = 300
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
infer_times = np.zeros((MEASURE_REPETITION,1))

# GPU warm-up
for _ in tqdm(range(WARMUP_REPETITION), desc="GPU warm-up", total=WARMUP_REPETITION):
 _ = model(dummy_input)

# Measure performance
with torch.no_grad():
 for rep in tqdm(range(MEASURE_REPETITION), desc="Measuring inference time", total=MEASURE_REPETITION):
 starter.record()
 _ = model(dummy_input)
 ender.record()

 # Wait for GPU sync
 torch.cuda.synchronize()

 curr_time = starter.elapsed_time(ender) # time unit is milliseconds
 curr_time = curr_time / 1000 # ms -> s
 infer_times[rep] = curr_time

mean_time = np.sum(infer_times) / MEASURE_REPETITION
std_time = np.std(infer_times)

print()
print(f"Mean: {mean_time:.3f} s, Std: {std_time:.3f} s")
print(f"FPS: {1 / mean_time:.3f}")

GPU warm-up: 100%|██████████| 100/100 [00:04<00:00, 24.66it/s]
Measuring inference time: 100%|██████████| 300/300 [00:13<00:00, 21.52it/s]
Mean: 44.390 s, Std: 0.890 s
FPS: 22.528

Common Mistakes when Measuring Time

When we measure the latency of a network, our goal is to measure only the feed-forward of the network (i.e. the inference), not more and not less.

Some common mistakes are listed below:

Transferring data between the host and the device

One of the most common mistakes involves the transfer of data between the CPU and GPU while taking time measurements. This is usually done unintentionally when a tensor is created on the CPU and inference is then performed on the GPU. This memory allocation takes a considerable amount of time, which subsequently enlarges the time for inference.

Not using GPU warm-up

The first run on the GPU prompts its initialization. GPU initialization can take up to 3 seconds, which makes a huge difference when the timing is in terms of milliseconds.

Using standard CPU timing

The most common mistake made is to measure time without synchronization.

Taking only one sample

A common mistake is to use ONLY one sample and refer to it as the run-time.

Like many processes in computer science, feed forward of the neural network has a (small) stochastic component. The variance of the run-time can be significant, especially when measuring a low latency network. To this end, it is essential to run the network over several examples and then average the results (300 examples can be a good number).

Measuring FPS

Once we have measured the inference time per image (in second), Frames Per Second (FPS) can be easily computed:

$$ FPS = \frac{1}{\text{inference time per image}} $$

Measuring Throughput

The throughput of a neural network is defined as the maximal number of input instances the network can process in time a unit (e.g., a second). To achieve maximal throughput we would like to process in parallel as many instances as possible. The effective parallelism is obviously data-, model-, and device-dependent.

Thus, to correctly measure throughput we perform the following two steps:

We estimate the optimal batch size that allows for maximum parallelism
- Rule of thumb: reach the memory limit of our GPU for the given data type
- Using a for loop, we increase by one the batch size until Run Time error is achieved, this identifies the largest batch size the GPU can process, for our neural network model and the input data it processes.
Given this optimal batch size, we measure the number of instances the network can process in one second.
- We process many batches (100 batches will be a sufficient number) and then use the following formula: $$ \frac{\text{\#batches} \times \text{batch size}}{\text{total time in seconds}} $$ This formula gives the number of examples our network can process in one second.

Code Snippet

import torch
import torchvision.models as models
import numpy as np
from tqdm import tqdm

# Assume that we have estimated the optimal batch size
device = torch.device("cuda")
model = models.resnet18(pretrained=True).to(device)
dummy_input = torch.randn([optimal_batch_size, 3, 1024, 2048], dtype=torch.float).to(device)

# Init loggers
MEASURE_REPETITION = 300
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)

total_time = 0

# Measure performance
with torch.no_grad():
 for rep in tqdm(range(MEASURE_REPETITION), desc="Measuring throughput", total=MEASURE_REPETITION):
 starter.record()
 _ = model(dummy_input)
 ender.record()

 # Wait for GPU sync
 torch.cuda.synchronize()

 curr_time = starter.elapsed_time(ender) / 1000
 total_time += curr_time

throughput = (MEASURE_REPETITION * optimal_batch_size) / total_time
print(f"Final Throughput: {throughput}")

Compute FLOPs

Firstly, we have to clearly distinguish between FLOPS and FLOPs

FLOPS: floating point operations per second, is a measure of computer (hardware) performance, useful in fields of scientific computations that require floating-point calculations.
FLOPs: floating point operations, is the amount of floating point operations, which is a metric for measurement of the complexity of a model or an algorithm.

To compute FLOPS, we can use fvcore. (More details see: Flop Counter for PyTorch Models)

Code example:

from fvcore.nn import FlopCountAnalysis

def get_FLOPs(model, dummy_input):
 flops = FlopCountAnalysis(model, dummy_input)
 return flops.total()

Reference

📈 Training

Mon, 07 Sep 2020 00:00:00 +0000

This section includes some practical tips and tools for training of neural networks with PyTorch.

‼️ Issues & Gotchas

Mon, 07 Sep 2020 00:00:00 +0000

This section summaries some issues and gotchas which may occur in practice.

PyTorch Recipe | Haobin Tan

🧾 PyTorch Recipes

🔥 Transfer Learning for Computer Vision

Handling settings for training and valiadtion phase flexibly

General function to train a model

Major Transfer Learning scenarios

ConvNet as fixed feature extractor

Implementation with PyTorch

Fine-tuning the ConvNet

Implementation with PyTorch

When and how to fine-tune?

Pratical advices

Google Colab Notebook

Reference

Saving and Loading Checkpoints

Motivation

How to save and load checkpoints?

Example

1. Import necessary libraries for loading our data

2. Define and intialize the neural network

3. Initialize the optimizer

4. Saving the general checkpoint

5. Load the general checkpoint

Google Colab Notebook

Reference

nn ModuleList vs. Sequential

nn.Module

Example

nn.Sequential

Example

nn.ModuleList

Example

nn.Sequential vs. nn.ModuleList

When to use which?

Reference

🔥 Custom Datasets and Transforms

Custom Dataset

Example

Custom transform and augmentation

Example

Use custom transform in our custom MNIST dataset

Reference

🔥🧾 General Training Steps Using PyTorch

Saving and Loading Models

state_dict

Example

Saving & Loading Model for Inference

Save/Load state_dict (Recommended)

Save/Load entire model

Saving & Loading a General Checkpoint for Inference and/or Resuming Training

Reference

Data Augmentation

What is data augmentation?

Premise of data augmentation

When to apply augmentation?

Use data augmentation in the right way

How to conduct data augmentation in PyTorch?

Use torchvision.transforms

🔥 Use albumentations

Demo

When will data augmentation be applied in PyTorch?

Reference

TorchScript

TorchScript

Steps for Loading a PyTorch Model in C++

Convert PyTorch Model to Torch Script

Tracing

Example

Scripting

Example

Mixing Scripting and Tracing

Saving aand Loading Script Module

Reference

Performance Measurement

Main Issues of Time Measurement

GPU Execution Mechanism: Asynchronous Execution

GPU Warm-up

The Correct Way to Measure Inference Time

Code Snippet

Common Mistakes when Measuring Time

`nn.Module`

`nn.Sequential`

`nn.ModuleList`

`nn.Sequential` vs. `nn.ModuleList`

`state_dict`

Save/Load `state_dict` (Recommended)

Use `torchvision.transforms`

🔥 Use `albumentations`