Training Issues

Here I will document some problems and issues I encountered while training the neural network.

Validation accuracy higher than training accuracy

Possible reasons

CUDA error (59): Device-side assert triggered

The error occurs due to the following two reasons:

  1. Inconsistency between the number of labels/classes and the number of output units
  2. The input of the loss function may be incorrect.

Inconsistency between the number of labels/classes and the number of output units

I came across this error when I was working on the Sign Language MNIST.

In this dataset, every image represents a label (0-25) as a one-to-one map for each alphabetic letter A-Z (and no cases for 9=J or 25=Z because of gesture motions). In other words, the greatest label is 24 and labels are noncontinuous (0, 1, …, 8, 10, 11, …, 24). There’re totally 24 label classes.

So, I just naively designed the last FC layer as followings:

self.fc3 = nn.Linear(48, 24)

Then this error occurred. Why?

The error is usually identified in the line where you do the backpropagation. Your loss function will be comparing the output from your model and the label of that observation in your dataset. In my case, output dimension of the last FC layer is 24, meaning that the greatest possible value for class label prediction is 23 (counting from zero). However, in this dataset, some of the labels have value 24, which is beyond the range (24 > 23)! This causes the error to be triggered!

How to fix it?

Make sure the number of output units match the number of your classes

In my case, I changed the output dimension of the last FC layer from 24 to 25:

self.fc3 = nn.Linear(48, 25)

Then everything works well!

Wrong input for the loss function

Loss functions have different ranges for the possible inputs that they can accept. If you choose an incompatible activation function for your output layer this error will be triggered. For example, BCELoss requires its input to be between 0 and 1. If the input(output from your model) is beyond the acceptable range for that particular loss function, the error gets triggered.

Small extra tip

The error messages you get when running into this error may not be very descriptive. To make sure you get the complete and useful stack trace, have this at the very beginning of your code and run it before anything else:

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

Reference: