Object Detection | Haobin Tan

Object Detection

Thu, 12 Nov 2020 00:00:00 +0000

Evaluation Metrics for Object Detection

Thu, 12 Nov 2020 00:00:00 +0000

Precision & Recall

Confusion matrix:

Precision: measures how accurate is your predictions. i.e. the percentage of your predictions are correct.
$$ \text{precision} = \frac{TP}{TP + FP} $$
Recall: measures how good you find all the positives.
$$ \text{recall} = \frac{TP}{TP + FN} $$

More see: Evaluation

IoU (Intersection over union)

IoU measures the overlap between 2 boundaries.
We use that to measure how much our predicted boundary overlaps with the ground truth (the real object boundary).

AP (Average Precision)

Let’s create an over-simplified example in demonstrating the calculation of the average precision. In this example, the whole dataset contains 5 apples only.

We collect all the predictions made for apples in all the images and rank it in descending order according to the predicted confidence level. The second column indicates whether the prediction is correct or not. In this example, the prediction is correct if IoU ≥ 0.5.

Let’s look at the 3rd row:

Precision: proportion of TP (= 2/3 = 0.67)
Recall: proportion of TP out of the possible positives (= 2/5 = 0.4)

Recall values increase as we go down the prediction ranking. However, precision has a zigzag pattern — it goes down with false positives and goes up again with true positives.

The general definition for the Average Precision (AP) is finding the area under the precision-recall curve above.

$$ \mathrm{AP}=\int\_{0}^{1} p(r) d r $$

Smoothing the Precision-Recall-Curve

Before calculating AP for the object detection, we often smooth out the zigzag pattern first: at each recall level, we replace each precision value with the maximum precision value to the right of that recall level.

The orange line is transformed into the green lines and the curve will decrease monotonically instead of the zigzag pattern.

Before smoothing	After smoothing

Mathematically, we replace the precision value for recall $\tilde{r}$ with the maximum precision for any recall $\geq \tilde{r}$.

$$ p\_{\text {interp}}(r)=\max\_{\tilde{r} \geq r} p(\tilde{r}) $$

Interpolated AP

PASCAL VOC is a popular dataset for object detection. In Pascal VOC2008, an average for the 11-point interpolated AP is calculated.

Divide the recall value from 0 to 1.0 into 11 points — 0, 0.1, 0.2, …, 0.9 and 1.0.
Compute the average of maximum precision value for these 11 recall values.
$$ \begin{aligned} A P &=\frac{1}{11} \sum\_{r \in\\{0.0, \ldots, 1.0\\}} A P\_{r} \\\\ &=\frac{1}{11} \sum\_{r \in\\{0.0, \ldots, 1.0\\}} p\_{\text {interp}}(r) \end{aligned} $$
- In our example:
  
  $AP = \frac{1}{11} \times (5 \times 1.0 + 4 \times 0.57 + 2 \times 0.5)$

However, this interpolated method is an approximation which suffers two issues.

It is less precise.
It lost the capability in measuring the difference for methods with low AP.

Therefore, a different AP calculation is adopted after 2008 for PASCAL VOC.

AP (Area under curve AUC)

For later Pascal VOC competitions, VOC2010–2012 samples the curve at all unique recall values (r₁, r₂, …), whenever the maximum precision value drops. With this change, we are measuring the exact area under the precision-recall curve after the zigzags are removed.

No approximation or interpolation is needed 👏. Instead of sampling 11 points, we sample $p(r\_i)$ whenever it drops and computes AP as the sum of the rectangular blocks.

$$ \begin{array}{l} p\_{\text {interp}}\left(r\_{n+1}\right)=\displaystyle{\max\_{\tilde{r} \geq r\_{n+1}}} p(\tilde{r}) \\\\ \mathrm{AP}=\sum\left(r\_{n+1}-r\_{n}\right) p\_{\text {interp}}\left(r\_{n+1}\right) \end{array} $$

This definition is called the Area Under Curve (AUC).

Reference

COCO JSON Format for Object Detection

Wed, 02 Dec 2020 00:00:00 +0000

The COCO dataset is formatted in JSON and is a collection of “info”, “licenses”, “images”, “annotations”, “categories” (in most cases), and “segment info” (in one case).

{
 "info": {...},
 "licenses": [...],
 "images": [...],
 "annotations": [...],
 "categories": [...], <-- Not in Captions annotations
 "segment_info": [...] <-- Only in Panoptic annotations
}

Note:

categories field is NOT in Captions annotations
segment_info field is ONLY in Panoptic annotations

Info

The “info” section contains high level information about the dataset. If you are creating your own dataset, you can fill in whatever is appropriate.

Example:

"info": {
 "description": "COCO 2017 Dataset",
 "url": "http://cocodataset.org",
 "version": "1.0",
 "year": 2017,
 "contributor": "COCO Consortium",
 "date_created": "2017/09/01"
}

Lincenses

The “licenses” section contains a list of image licenses that apply to images in the dataset

Example:

"licenses": [
 {
 "url": "http://creativecommons.org/licenses/by-nc-sa/2.0/",
 "id": 1,
 "name": "Attribution-NonCommercial-ShareAlike License"
 },
 {
 "url": "http://creativecommons.org/licenses/by-nc/2.0/",
 "id": 2,
 "name": "Attribution-NonCommercial License"
 },
 ...
]

Images

Contains the complete list of images in your dataset
No labels, bounding boxes, or segmentations specified in this part, it’s simply a list of images and information about each one.
coco_url, flickr_url, and date_captured are just for reference. Your deep learning application probably will only need the file_name.
Image ids need to be unique (among other images)
They do not necessarily need to match the file name (unless the deep learning code you are using makes an assumption that they’ll be the same)

Example:

"images": [
 {
 "license": 4,
 "file_name": "000000397133.jpg",
 "coco_url": "http://images.cocodataset.org/val2017/000000397133.jpg",
 "height": 427,
 "width": 640,
 "date_captured": "2013-11-14 17:02:52",
 "flickr_url": "http://farm7.staticflickr.com/6116/6255196340_da26cf2c9e_z.jpg",
 "id": 397133
 },
 {
 "license": 1,
 "file_name": "000000037777.jpg",
 "coco_url": "http://images.cocodataset.org/val2017/000000037777.jpg",
 "height": 230,
 "width": 352,
 "date_captured": "2013-11-14 20:55:31",
 "flickr_url": "http://farm9.staticflickr.com/8429/7839199426_f6d48aa585_z.jpg",
 "id": 37777
 },
 ...
]

Annotations

COCO has five annotation types: for object detection, keypoint detection, stuff segmentation, panoptic segmentation, and image captioning. The annotations are stored using JSON.

Object detection

it draws shapes around objects in an image. It has a list of categories and annotations.

Annotations

segmentation : list of points (represented as $(x, y)$ coordinate ) which define the shape of the object
area : measured in pixels (e.g. a 10px by 20px box would have an area of 200)
iscrowd : specifies whether the segmentation is for a single object (iscrowd=0) or for a group/cluster of objects (iscrowd=1)
image_id: corresponds to a specific image in the dataset
bbox : bounding box, format is [top left x position, top left y position, width, height]
category_id: corresponds to a single category specified in the categories section
id: Each annotation also has an id (unique to all other annotations in the dataset)

Example:

"annotations": [
 {
 "segmentation": [[510.66,423.01,511.72,420.03,...,510.45,423.01]],
 "area": 702.1057499999998,
 "iscrowd": 0,
 "image_id": 289343,
 "bbox": [473.07,395.93,38.65,28.67],
 "category_id": 18,
 "id": 1768
 },
 ...
]

Has a segmentation list of vertices (x, y pixel positions)
Has an area of 702 pixels (pretty small) and a bounding box of [473.07,395.93,38.65,28.67]
Is not a crowd (meaning it’s a single object)
Is category id of 18 (which is a dog)
Corresponds with an image with id 289343 (which is a person on a strange bicycle and a tiny dog)

Example

Source: https://roboflow.com/formats/coco-json

{
 "info": {
 "year": "2020",
 "version": "1",
 "description": "Exported from roboflow.ai",
 "contributor": "Roboflow",
 "url": "https://app.roboflow.ai/datasets/hard-hat-sample/1",
 "date_created": "2000-01-01T00:00:00+00:00"
 },
 "licenses": [
 {
 "id": 1,
 "url": "https://creativecommons.org/publicdomain/zero/1.0/",
 "name": "Public Domain"
 }
 ],
 "categories": [
 {
 "id": 0,
 "name": "Workers",
 "supercategory": "none"
 },
 {
 "id": 1,
 "name": "head",
 "supercategory": "Workers"
 },
 {
 "id": 2,
 "name": "helmet",
 "supercategory": "Workers"
 },
 {
 "id": 3,
 "name": "person",
 "supercategory": "Workers"
 }
 ],
 "images": [
 {
 "id": 0,
 "license": 1,
 "file_name": "0001.jpg",
 "height": 275,
 "width": 490,
 "date_captured": "2020-07-20T19:39:26+00:00"
 }
 ],
 "annotations": [
 {
 "id": 0,
 "image_id": 0,
 "category_id": 2,
 "bbox": [
 45,
 2,
 85,
 85
 ],
 "area": 7225,
 "segmentation": [],
 "iscrowd": 0
 },
 {
 "id": 1,
 "image_id": 0,
 "category_id": 2,
 "bbox": [
 324,
 29,
 72,
 81
 ],
 "area": 5832,
 "segmentation": [],
 "iscrowd": 0
 }
 ]
}

Reference

COCO Data format
Create COCO Annotations From Scratch
- Video tutorial

You Only Look Once (YOLO)

Wed, 04 Nov 2020 00:00:00 +0000

The problem of sliding windows method is that it does not output the most accuracte bounding boxes. A good way to get this output more accurate bounding boxes is with the YOLO (You Only Look Once) algorithm.

Overview: How does YOLO work?

Let’s say we have an input image (e.g. at 100x100), we’re going to place down a grid on this image. For the purpose of simplicity and illustration, we’re going to use a 3x3 grid as example.

(In an actual implementation, we’ll use a finer one, like 19x19 grid)

Labels for training

For each grid cell, we specify a target label $\mathbf{y}$:

$$ \mathbf{y} = \left( \begin{array}{c} P\_c \\\\ b\_x \\\\ b\_y \\\\ b\_h \\\\ b\_w \\\\ c\_1 \\\\ c\_2 \\\\ \vdots \\\\ c\_n \end{array} \right) \in \mathbb{R}^{5 + n} $$

$P\_c$: objectness
- depends on whether there’s an object in that grid cell.
- If yes, then $P\_c = 1$. else $P\_c=0$
Bounding box coordinates
- $b\_x, b\_y \in (0, 1)$: describe the center point of the object relative to the grid cell
  - If $>1$, then the center point is outside of the current grid cell and it should be assigned to another grid cell
  - Some parameterizations also use Sigmoid function to ensure $b\_x, b\_y \in (0, 1)$
- $b\_h, b\_w$: height and width of the bounding box,
  - specified as a fraction of the overall width of the grid cell (can be $\geq 1$)
  - Some parameterizations also use exponential function to ensure non-negativity
$c\_1, c\_2, \dots, c\_n$: object classes probabilities we want to detect
- E.g. we want to detect 3 classes of object:
  - pedestrian ($c\_1$),
  - car ($c\_2$),
  - motorcycle ($c\_3$),
  so our target $\mathbf{y}$ will be:
  $$ \mathbf{y} = \left( \begin{array}{c} P\_c \\\\ b\_x \\\\ b\_y \\\\ b\_h \\\\ b\_w \\\\ c\_1 \\\\ c\_2 \\\\ c\_3 \end{array} \right) \in \mathbb{R}^{8} $$

Example

If we consider the upper left grid cell (at position $(0, 0)$)

There’s no object in this grid cell, so $P\_c = 0$, and we don’t have to care for the rest elements of $\mathbf{y}$:

$$ \mathbf{y} = \left( \begin{array}{c} 0 \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \end{array} \right) \in \mathbb{R}^{8} $$

Here we use the symbol ? to mark “don’t care”.

However, neural network can’t output a question mark, can’t output a “don’t care”. So wes’ll put some numbers for the rest. But these numbers will basically be ignored because the neural network is telling you that there’s no object there. So it doesn’t really matter whether the output is a bounding box or there’s is a car. So basically just be some set of numbers, more or less noise.

Now, how about the grid cells in the second row?

To give a bit more detail, this image has two objects. And what the YOLO algorithm does is

it takes the midpoint of reach of the two objects and then assigns the object to the grid cell containing the midpoint. So the left car is assigned to the left grid cell marked with green; and the car on the rightis assigned to the grid cell marked with yellow.
- For the left grid cell marked with green, the target label $\mathbf{y}$ would be as follows: $$ \mathbf{y} = \left( \begin{array}{c} 1 \\\\ b\_x \\\\ b\_y \\\\ b\_h \\\\ b\_w \\\\ 0 \\\\ 1 \\\\ 0 \end{array} \right) $$
Even though the central grid cell has some parts of both cars, we’ll pretend the central grid cell has no interesting object. So the class label of the central grid cell is
$$ \mathbf{y} = \left( \begin{array}{c} 0 \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \end{array} \right) $$

For each of these 9 grid cells, we end up with a 8 dimensional output vector. So the total target output volume is $(3 \times 3) \times 8$.

Generally speaking, assuming that we have $n \times n$ grid cells, and we want to detect $C$ classes of objects, then the target output volume will be $(n \times n) \times (5 + C)$.

Training

To train our neural network, the input is $100 \times 100 \times 3$. And then we have a usual convnet with conv, layers of max pool layers, and so on. So that in the end, this eventually maps to a $3 \times 3 \times 8$ output volume. And so what we do is we have an input $X$ which is the input image like that, and we have these target labels $\mathbf{y}$ which are $3 \times 3 \times 8$, and we use backpropagation to train the neural network to map from any input $X$ to this type of output volume $\mathbf{y}$.

👍 Advantages

The neural network outputs precise bounding boxes 👏
Effeicient and fast thanks to convolution operations 👏

Intersection over Union (IoU)

How can we tell whether our object detection algorithm is working well?

The Intersection-over-Union (IoU), aka Jaccard Index or Jaccard Overlap, measure the degree or extent to which two boxes overlap.

Intersection over Union (IoU). Src: [a-PyTorch-Tutorial-to-Object-Detection](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection)

In object detection:

$$ \text{IoU} = \frac{\text{Overlapping region between ground truth and prediction bounding box}}{\text{Combined region of ground truth and prediction bounding box}} $$

If $\text{IoU} \geq \text{threshold}$, we would say the prediction is correct.

By convention, $\text{threshold} = 0.5$. We can also chosse other value greater than 0.5.

Example:

IoU example. Src: [026 CNN Intersection over Union | Master Data Science](https://www.google.com/url?sa=i&url=http%3A%2F%2Fdatahacker.rs%2Fdeep-learning-intersection-over-union%2F&psig=AOvVaw2K4pvRAkwPw3FZYIelxngf&ust=1604671149058000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCIjNgoLI6-wCFQAAAAAdAAAAABAg)

Non-max suppresion

One of the problems we have addressed in YOLO is that it can detect the same object multiple times.

For example:

Each car has two or more detections with different probabilities. The reason is that some of the grids that thinks that they contain the center point of the object. Src: [a-PyTorch-Tutorial-to-Object-Detection](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection)

Non-max Suppression is a way to make sure that YOLO detects the object just once. It cleans up redundant detections. So they end up with just one detection per object, rather than multiple detections per object.

Takes the detection with the largest $P\_c$ (the probability of a detection) (“That’s my most confident detection, so let’s highlight that and just say I found the car there.”)
Looks at all of the remaining rectangles and all the ones with a high overlap (i.e. with a high IOU), just suppress/darken/discard them

Example:

Non-max suppression example. Src: [An overview of object detection: one-stage methods.](https://www.jeremyjordan.me/object-detection-one-stage/)

For multi-class detection, non-max suppression should be carried out on each class separately.

Anchor box

One of the problems with object detection as we have seen it so far is that each of the grid cells can detect only one object. What if a grid cell wants to detect multiple objects?

For example: we want to detect 3 classes (pedestrians, cars, motorcycles), and our input image looks like this:

The midpoint of the pedestrian and the midpoint of the car are in almost the same place and both of them fall into the same grid cell. If the output vector

$$ \mathbf{y} = \left( \begin{array}{c} P\_c \\\\ b\_x \\\\ b\_y \\\\ b\_h \\\\ b\_w \\\\ c\_1 \\\\ c\_2 \\\\ c\_3 \end{array} \right) $$

we have seen before, it won’t be able to output two detections 😢.

With the idea of anchor boxes, we are going to

pre-defne a number of different shapes of anchor boxes (in this example, just 2)

and associate them in the class labels
$$ \mathbf{y} = \left(\underbrace{P\_c, b\_x, b\_y, b\_h, b\_w, c\_1, c\_2, c\_3}\_{\text{anchor box 1}} , \underbrace{P\_c, b\_x, b\_y, b\_h, b\_w, c\_1, c\_2, c\_3}\_{\text{anchor box 2}}\right)^T \in \mathbb{R}^{16} $$
- Because the shape of the pedestrian is more similar to the shape of anchor box 1 than anchor box 2, we can use the first eight numbers to encode pedestrian.
- Because the box around the car is more similar to the shape of anchor box 2 than anchor box 1, we can then use the second 8 numbers to encode that the second object here is the car

To summarise, with a number of pre-defined anchor boxes: Each object in training image is assigned to

the grid cell that contains object’s midpoint and
anchor box for the grid cell with the highest IoU with the ground truth bounding box

In other words, now the object is assigned to a $(\text{grid cell}, \text{anchor box})$ pair.

we have pre-defined $B$ different size of bounding boxes
the size of input image is $n \times n$
we want to detect $C$ classes of objects

Then the output volume will be

$$ (n \times n) \times B(5 + C) $$

How to choose the anchor boxes?

People used to just choose them by hand or choose maybe 5 or 10 anchor box shapes that spans a variety of shapes that seems to cover the types of objects to detect
A better way to do this is to use a K-means algorithm, to group together two types of objects shapes we tend to get. (in the later YOLO research paper)

Putting them all together

Suppose we’re trying to train a model to detect three classes of objects:

pedestrians
cars
motorcycles

And the input image looks like this:

Suppose we have pre-defined two different sizes of bounding boxes

Anchor box 2 has a higher IoU with the ground truth bounding box of the car, then:

The final output volume is $3 \times 3 \times 2 \times 8$

Making predictions

Outputing the non-max supressed outputs

Let’s look at an new input image,

and suppose that we still use 2 pre-defined anthor boxes for detecting pedestrians, cars, and motorcycles.

For each grid cell, get 2 predicted bounding boxes. Notice that some of the bounding boxes can go outside the height and width of the grid cell that they came from
Get rid of low probability predictions
For each class, use non-max suppression to generate final predictions. And so the output of this is hopefully that we will have detected all the cars and all the pedestrians in this image.

Reference

YOLOv4: Run Pretrained YOLOv4 on COCO Dataset

Wed, 04 Nov 2020 00:00:00 +0000

Here we will learn how to get YOLOv4 Object Detection running in the Cloud with Google Colab step by step.

Check out the Google Colab Notebook

Clone and build DarkNet

Clone darknet from AlexeyAB’s repository,

!git clone https://github.com/AlexeyAB/darknet

Adjust the Makefile to enable OPENCV and GPU for darknet

# change makefile to have GPU and OPENCV enabled
%cd darknet
!sed -i 's/OPENCV=0/OPENCV=1/' Makefile
!sed -i 's/GPU=0/GPU=1/' Makefile
!sed -i 's/CUDNN=0/CUDNN=1/' Makefile
!sed -i 's/CUDNN_HALF=0/CUDNN_HALF=1/' Makefile

Verify CUDA

# verify CUDA
!/usr/local/cuda/bin/nvcc --version

Build darknet

Note: Do not worry about any warnings when running the !make cell!

# make darknet 
# (builds darknet so that you can then use the darknet executable file 
# to run or train object detectors)
!make

Download pretrained YOLO v4 weights

YOLOv4 has been trained already on the coco dataset which has 80 classes that it can predict. We will grab these pretrained weights so that we can run YOLOv4 on these pretrained classes and get detections.

!wget https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights

Define helper functions

import cv2
import matplotlib.pyplot as plt
%matplotlib inline


def imShow(path):
 """
 Show image
 """
 image = cv2.imread(path)
 height, width = image.shape[:2]
 resized_image = cv2.resize(image, (3*width, 3*height), interpolation = cv2.INTER_CUBIC)

 fig = plt.gcf()
 fig.set_size_inches(18, 10)
 plt.axis("off")
 plt.imshow(cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB))
 plt.show()


def upload():
 """
 upload files to Google Colab
 """
 from google.colab import files
 uploaded = files.upload()
 for name, data in uploaded.items():
 with open(name, 'wb') as f:
 f.write(data)
 print(f'saved file {name}')


def download(path):
 """
 Download from Google Colab
 """
 from google.colab import files
 files.download(path)

Run detections with Darknet and YOLOv4

The object detector can be run using the following command

!./darknet detector test <path to .data file> <path to config> <path to weights> <path to image>

This will output the image with the detections shown. The most recent detections are always saved to ‘predictions.jpg’

Note: After running detections OpenCV can’t open the image instantly in the cloud so we must run:

imShow('predictions.jpg')

Darknet comes with a few images already installed in the darknet/data/ folder. Let’s test one of the images inside:

# run darknet detection on test images
!./darknet detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights data/person.jpg

imShow('predictions.jpg')

Run detections using uploaded image

We can also mount Google drive into the cloud VM a

from google.colab import drive
drive.mount('/content/gdrive')

# this creates a symbolic link 
# so that now the path /content/gdrive/My\ Drive/ is equal to /mydrive
!ln -s /content/gdrive/My\ Drive/ /mydrive
!ls /mydrive

nd run YOLOv4 with images from Google drive using the following command:

!./darknet detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights /mydrive/<path to image>

For example, I uploaded an image called “pedestrian.jpg” in images/ folder:

and run detection on it:

!./darknet detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights /mydrive/images/pedestrian.jpg
imShow('predictions.jpg')

Reference

YOLOv4 in the CLOUD: Install and Run Object Detector (FREE GPU)
- Google Colab Notebook
- https://github.com/theAIGuysCode/YOLOv4-Cloud-Tutorial
- Video Tutorial

YOLOv4: Train on Custom Dataset

Wed, 04 Nov 2020 00:00:00 +0000

Clone and build Darknet

Clone darknet repo

git clone https://github.com/AlexeyAB/darknet

Change makefile to have GPU and OPENCV enabled

cd darknet
sed -i 's/OPENCV=0/OPENCV=1/' Makefile
sed -i 's/GPU=0/GPU=1/' Makefile
sed -i 's/CUDNN=0/CUDNN=1/' Makefile
sed -i 's/CUDNN_HALF=0/CUDNN_HALF=1/' Makefile

Verify CUDA

/usr/local/cuda/bin/nvcc --version

Compile on Linux using `make`

Make darknet

make

GPU=1 : build with CUDA to accelerate by using GPU
CUDNN=1 : build with cuDNN v5-v7 to accelerate training by using GPU
CUDNN_HALF=1 to build for Tensor Cores (on Titan V / Tesla V100 / DGX-2 and later) speedup Detection 3x, Training 2x
OPENCV=1 to build with OpenCV 4.x/3.x/2.4.x - allows to detect on video files and video streams from network cameras or web-cams
DEBUG=1 to bould debug version of Yolo
OPENMP=1 to build with OpenMP support to accelerate Yolo by using multi-core CPU

Do not worry about any warnings when running make command.

Prepare custom dataset

The custom dataset should be in YOLOv4 or darknet format:

For each .jpg image file, there should be a corresponding .txt file
- In the same directory, with the same name, but with .txt-extension
  
  For example, if there’s an .jpg image named BloodImage_00001.jpg, there should also be a corresponding .txt file named BloodImage_00001.txt
In this .txt file: object number and object coordinates on this image, for each object in new line.

Format:
```
<object-class> <x_center> <y_center> <width> <height>
```
- <object-class> : integer object number from 0 to (classes-1)
- <x_center> <y_center> <width> <height> : float values relative to width and height of image, it can be equal from (0.0 to 1.0]
  - <x_center> <y_center> are center of rectangle (are not top-left corner)

Configure files for training

For training cfg/yolov4-custom.cfg download the pre-trained weights-file yolov4.conv.137

cd darknet
wget https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.conv.137

In folder ./cfg, create custom config file (let’s call it custom-yolov4-detector.cfg) with the same content as in yolov4-custom.cfg and
- change line batch to batch=64
- change line subdivisions to subdivisions=16
- change line max_batches to classes*2000 but
  - NOT less than number of training images
  - NOT less than number of training images
  - NOT less than 6000
  e.g. max_batches=6000 if you train for 3 classes
- change line steps to 80% and 90% of max_batches (e.g. steps=4800, 5400)
- set network size width=416 height=416 or any value multiple of 32
- change line classes=80 to number of objects in each of 3 [yolo]-layers
- change [filters=255] to $ \text{filters}=(\text{classes} + 5) \times 3$ in the 3 [convolutional] before each [yolo] layer, keep in mind that it only has to be the last [convolutional] before each of the [yolo] layers.
  Note: Do not write in the cfg-file: filters=(classes + 5) x 3!!!
  
  It has to be the specific number!
  
  E.g. classes=1 then should be filters=18; classes=2 then should be filters=21
  
  So for example, for 2 objects, your custom config file should differ from yolov4-custom.cfg in such lines in each of 3 [yolo]-layers:
```
[convolutional]
filters=21

[region]
classes=2
```
- when using [Gaussian_yolo] layers, change [filters=57] $ \text{filters}=(\text{classes} + 9) \times 3$ in the 3 [convolutional] before each [Gaussian_yolo] layer
Create file obj.names in the directory data/, with objects names - each in new line
Create fiel obj.data in the directory data/, containing (where classes = number of objects):

For example, if we two objects
```
classes = 2
train = data/train.txt
valid = data/test.txt
names = data/obj.names
backup = backup/
```
Put image files (.jpg) of your objects in the directory data/obj/
Create train.txt in directory data/ with filenames of your images, each filename in new line, with path relative to darknet.

For example containing:
```
data/obj/img1.jpg
data/obj/img2.jpg
data/obj/img3.jpg
```
Download pre-trained weights for the convolutional layers and put to the directory darknet (root directory of the project)
- for yolov4.cfg, yolov4-custom.cfg (162 MB): yolov4.conv.137
- for yolov4-tiny.cfg, yolov4-tiny-3l.cfg, yolov4-tiny-custom.cfg(19 MB): yolov4-tiny.conv.29
- for csresnext50-panet-spp.cfg (133 MB): csresnext50-panet-spp.conv.112
- for yolov3.cfg, yolov3-spp.cfg (154 MB): darknet53.conv.74
- for yolov3-tiny-prn.cfg , yolov3-tiny.cfg (6 MB): yolov3-tiny.conv.11
- for enet-coco.cfg (EfficientNetB0-Yolov3) (14 MB): enetb0-coco.conv.132

Start training

./darknet detector train data/obj.data custom-yolov4-detector.cfg yolov4.conv.137 -dont_show

file yolo-obj_last.weights will be saved to the backup\ for each 100 iterations
-dont_show: disable Loss-Window, if you train on computer without monitor (e.g remote server)

To see the mAP & loss0chart during training on remote server:

use command ./darknet detector train data/obj.data yolo-obj.cfg yolov4.conv.137 -dont_show -mjpeg_port 8090 -map
then open URL http://ip-address:8090 in Chrome/Firefox browser)

After training is complete, you can get weights from backup/

If you want the training to output only main information (e.g loss, mAP, remaining training time) instead of full logging, you can use this command

./darknet detector train data/obj.data custom-yolov4-detector.cfg yolov4.conv.137 -dont_show -map 2>&1 | tee log/train.log | grep -E "hours left|mean_average"

Then the output will look like followings:

 1189: 1.874030, 2.934438 avg loss, 0.002610 rate, 2.930427 seconds, 76096 images, 3.905244 hours left

Notes

If during training you see nan values for avg (loss) field - then training goes wrong! 🤦‍♂️

But if nan is in some other lines - then training goes well.
if error Out of memory occurs then in .cfg-file you should increase subdivisions=16, 32 or 64

Train tiny-YOLO

Do all the same steps as for the full yolo model as described above. With the exception of:

Download file with the first 29-convolutional layers of yolov4-tiny:
```
wget https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.conv.29
```
(Or get this file from yolov4-tiny.weights file by using command: ./darknet partial cfg/yolov4-tiny-custom.cfg yolov4-tiny.weights yolov4-tiny.conv.29 29)

Make your custom model yolov4-tiny-obj.cfg based on cfg/yolov4-tiny-custom.cfg instead of yolov4.cfg

import re

# num_classes: number of object classes
max_batches = max(num_classes * 2000, num_train_images, 6000)
steps1 = .8 * max_batches
steps2 = .9 * max_batches
num_filters = (num_classes + 5) * 3

# Assuming that we have already defined the following hyperparameters:
# - TINY_CONFIG_FILE: config file we're gonna use for training
# - WIDTH, HEIGHT: width and height of image
with open("cfg/yolov4-tiny-custom.cfg", "r") as reader, open(TINY_CONFIG_FILE, "w") as writer:
 content = reader.read()

 content = re.sub("subdivisions=\d*", f"subdivisions={SUBDIVISION}", content)
 content = re.sub("width=\d*", f"width={WIDTH}", content)
 content = re.sub("height=\d*", f"height={HEIGHT}", content)
 content = re.sub("max_batches = \d*", f"max_batches = {max_batches}", content)
 content = re.sub("steps=\d*,\d*", f"steps={steps1},{steps2}", content)
 content = re.sub("classes=\d*", f"classes={num_classes}", content)
 content = re.sub("pad=1\nfilters=\d*", f"pad=1\nfilters={num_filters}", content)

 writer.write(content)

Start training:

./darknet detector train data/obj.data yolov4-tiny-obj.cfg yolov4-tiny.conv.29

Google Colab Notebook

Colab Notebook

Small hacks to keep colab notebook training

Open up the inspector view on Chrome
Switch to the console window

Paste the following code

function ClickConnect(){
console.log("Working");
document
 .querySelector('#top-toolbar > colab-connect-button')
 .shadowRoot.querySelector('#connect')
 .click()
}
setInterval(ClickConnect,60000)

and hit Enter.

It will click the screen every 10 minutes so that you don’t get kicked off for being idle!

Convert YOLOv4 to TensorRT through ONNX

To convert YOLOv4 to TensorRT engine through ONNX, I used the code from TensorRT_demos following its step-by-step instructions. For more details about the code, check out this blog post.

Note that the Code in this repo was designed to run on Jetson platforms. In my case, conversion from YOLOv4 to TensorRT engine was conducted on Jetson Nano.

Convert YOLOv4 for custom trained models

To apply the conversion for custom trained models, see TensorRT YOLOv3 For Custom Trained Models. You need to stick to the naming convention {yolo_version}-{custom_name}-{image_size}. Otherwise you’ll get errors during conversion.

Reference

Guide from AlexeyAB/darknet repo: How to train (to detect your custom objects)
Tutorials
- 👨‍🏫 How to Train YOLOv4 on a Custom Dataset in Darknet
  - Colab Notebook
  - Blog post: https://blog.roboflow.com/training-yolov4-on-a-custom-dataset/
  - Video tutorial:
  - YOLOv4 - Ten Tactics to Build a Better Model
- Train YOLOv4-tiny on custom dataset: Train YOLOv4-tiny on Custom Data - Lightning Fast Object Detection
- YOLOv4 in the CLOUD: Build and Train Custom Object Detector (FREE GPU)
  - Colab Notebook
  - Video tutorial:
- Custom YOLOv4 Model on Google Colab
  - Colab Notebook
- TensorRT YOLOv4
- YOLOv4 on Jetson Nano

Annotation Conversion: COCO JSON to YOLO Txt

Wed, 02 Dec 2020 00:00:00 +0000

Bounding box formats comparison and conversion

In COCO Json, the format of bounding box is:

"bbox": [
 <absolute_x_top_left>,
 <absolute_y_top_left>,
 <absolute_width>,
 <absolute_height>
]

However, the annotation is different in YOLO. For each .jpg image, there’s a .txt file (in the same directory and with the same name, but with .txt-extension). This .txt file holds the objects and their bounding boxes in this image (one line for each object), in the following format ¹:

<object-class> <relative_x_center> <relative_y_center> <relative_width> <relative_height>

<object-class> : integer number of object from 0 to (classes-1)
<relative_x_center> <relative_y_center> <relative_width> <relative_height>

float values relative to width and height of image (equal from (0.0 to 1.0])

For example, for img1.jpg there should be img1.txt containing something looks like followings:

1 0.716797 0.395833 0.216406 0.147222
0 0.687109 0.379167 0.255469 0.158333
2 0.420312 0.395833 0.140625 0.166667

The following figure illustrates the difference of bounding box annotation between COCO and YOLO:

Bounding box format: COCO vs YOLO

Convert the bounding box annotation format from COCO to YOLO:

$$ \begin{array}{ll} x\_{yolo} &= (x\_{coco} + \frac{w\_{coco}}{2}) / w\_{img} \\\\ y\_{yolo} &= (y\_{coco} + \frac{h\_{coco}}{2}) / h\_{img} \\\\ w\_{yolo} &= w\_{coco} / w\_{img} \\\\ h\_{yolo} &= h\_{coco} / h\_{img} \end{array} $$

def convert_bbox_coco2yolo(img_width, img_height, bbox):
 """
 Convert bounding box from COCO format to YOLO format

 Parameters
 ----------
 img_width : int
 width of image
 img_height : int
 height of image
 bbox : list[int]
 bounding box annotation in COCO format:
 [top left x position, top left y position, width, height]

 Returns
 -------
 list[float]
 bounding box annotation in YOLO format:
 [x_center_rel, y_center_rel, width_rel, height_rel]
 """

 # YOLO bounding box format: [x_center, y_center, width, height]
 # (float values relative to width and height of image)
 x_tl, y_tl, w, h = bbox

 dw = 1.0 / img_width
 dh = 1.0 / img_height

 x_center = x_tl + w / 2.0
 y_center = y_tl + h / 2.0

 x = x_center * dw
 y = y_center * dh
 w = w * dw
 h = h * dh

 return [x, y, w, h]

Convert COCO JSON to YOLO txt

The structure of training set in COCO format is:

- train
 |- _annotations.coco.json
 |- img_001.jpg
 |- img_002.jpg
 |- img_003.jpg
 ...

_annotations.coco.json contains all information about the dataset, images, and annotations. (More see: COCO JSON Format for Object Detection)

The structure of training set in YOLO format is:

- train
 |- _darknet.labels
 |- img_001.jpg
 |- img_001.txt
 |- img_002.jpg
 |- img_002.txt
 |- img_003.jpg
 |- img_003.txt
 ...

_darknet.labels contains objects names, each in new line
For each .jpg image there’s a corresponding .txt file with the same name

Now we create .txt file for each image based on _annotations.coco.json:

import os
import json
from tqdm import tqdm
import shutil

def make_folders(path="output"):
 if os.path.exists(path):
 shutil.rmtree(path)
 os.makedirs(path)
 return path


def convert_coco_json_to_yolo_txt(output_path, json_file):

 path = make_folders(output_path)

 with open(json_file) as f:
 json_data = json.load(f)

 # write _darknet.labels, which holds names of all classes (one class per line)
 label_file = os.path.join(output_path, "_darknet.labels")
 with open(label_file, "w") as f:
 for category in tqdm(json_data["categories"], desc="Categories"):
 category_name = category["name"]
 f.write(f"{category_name}\n")

 for image in tqdm(json_data["images"], desc="Annotation txt for each iamge"):
 img_id = image["id"]
 img_name = image["file_name"]
 img_width = image["width"]
 img_height = image["height"]

 anno_in_image = [anno for anno in json_data["annotations"] if anno["image_id"] == img_id]
 anno_txt = os.path.join(output_path, img_name.split(".")[0] + ".txt")
 with open(anno_txt, "w") as f:
 for anno in anno_in_image:
 category = anno["category_id"]
 bbox_COCO = anno["bbox"]
 x, y, w, h = convert_bbox_coco2yolo(img_width, img_height, bbox_COCO)
 f.write(f"{category} {x:.6f} {y:.6f} {w:.6f} {h:.6f}\n")

 print("Converting COCO Json to YOLO txt finished!")

Example

Assuming we have a COCO Json file _annotations.coco.json:

{
 "info": {
 "year": "2020",
 "version": "1",
 "description": "Exported from roboflow.ai",
 "contributor": "Roboflow",
 "url": "https://app.roboflow.ai/datasets/hard-hat-sample/1",
 "date_created": "2000-01-01T00:00:00+00:00"
 },
 "licenses": [
 {
 "id": 1,
 "url": "https://creativecommons.org/publicdomain/zero/1.0/",
 "name": "Public Domain"
 }
 ],
 "categories": [
 {
 "id": 0,
 "name": "Workers",
 "supercategory": "none"
 },
 {
 "id": 1,
 "name": "head",
 "supercategory": "Workers"
 },
 {
 "id": 2,
 "name": "helmet",
 "supercategory": "Workers"
 },
 {
 "id": 3,
 "name": "person",
 "supercategory": "Workers"
 }
 ],
 "images": [
 {
 "id": 0,
 "license": 1,
 "file_name": "0001.jpg",
 "height": 275,
 "width": 490,
 "date_captured": "2020-07-20T19:39:26+00:00"
 }
 ],
 "annotations": [
 {
 "id": 0,
 "image_id": 0,
 "category_id": 2,
 "bbox": [
 45,
 2,
 85,
 85
 ],
 "area": 7225,
 "segmentation": [],
 "iscrowd": 0
 },
 {
 "id": 1,
 "image_id": 0,
 "category_id": 2,
 "bbox": [
 324,
 29,
 72,
 81
 ],
 "area": 5832,
 "segmentation": [],
 "iscrowd": 0
 }
 ]
}

convert_coco_json_to_yolo_txt("output", "_annotations.coco.json")

Categories: 100%|██████████| 4/4 [00:00<00:00, 2471.24it/s]
Annotation txt for each iamge: 100%|██████████| 1/1 [00:00<00:00, 1800.13it/s]
Converting COCO Json to YOLO txt finished!

An folder named output is created and has the structure:

- output
 |- 0001.txt
 |- _darknet.labels

Content of _darknet.labels:

Workers
head
helmet
person

Content of 0001.txt:

2 0.178571 0.161818 0.173469 0.309091
2 0.734694 0.252727 0.146939 0.294545

Reference

Reference: https://github.com/AlexeyAB/Yolo_mark/issues/60 ↩︎

YOLOv4: Training Tips

Sat, 19 Dec 2020 00:00:00 +0000

Model zoo

YOLOv4 model zoo

Pretrained models
Proper configuration based on GPU

We do NOT suggest you train the model with subdivisions equal or larger than 32, it will takes very long training time.

FAQ

Low accuracy ¹

The most common problem - you do NOT follow strictly the manual.

You must use
- default anchors
- learning_rate=0.001
- batch=64
- max_batches = max(6000, number_of_training_images, 2000*classes)
You can only change subdivisions
Do not do anything that is not written in the manual. 🙅‍♂️

Your datasets are wrong.

check the AP50 (average precision) for validation and training dataset by using ./darknet detector map obj.data yolo.cfg yolo.weights
- If you get high mAP for both Training and Validation datasets, but the network detects objects poorly in real life, then your training dataset is not representative –> add more images from real life to it
- If you get high mAP for Training dataset, but low for Validation dataset, then your Training dataset isn’t suitable for Validation dataset.
  
  For example
  - Training dataset contains: cars (rear view) from distance 100m
  - Test dataset contains: cars (side view) from distance 5m
- if you get low mAP for both Training and Validation datasets, then labels in your Training dataset are wrong
  - Run training with flag -show_imgs, i.e. ./darknet detector train ... -show_imgs , do you see correct bounded boxes?
  - Or check your dataset by using Yolo_mark tool

Darknet training/detection crashes with an error ²

If CUDA Out of memory error occurs, then increase subdivisions= 2 times in cfg-file, but not higher than batch= (don’t change batch)!
- If it doesn’t help - set random=0 and width=416 height=416 in cfg-file.
Check content of files bad.list and bad_label.list if they exist near with ./darknet executable file.
Do not move some files from Darknet folder - you may forget the necessary files.
Download libraries CUDA, cuDNN, OpenCV, … only from official sources. Don’t download libs from other sites.
Make sure that you do everything in accordance with the manual, and do not do anything that is not written in the manual.

Train with multiple GPUs ³

Train it first on 1 GPU for like 1000 iterations:

./darknet detector train cfg/coco.data cfg/yolov4.cfg yolov4.conv.137

Then stop and by using partially-trained model /backup/yolov4_1000.weights. Run training with multigpu (up to 4 GPUs): ./darknet detector train cfg/coco.data cfg/yolov4.cfg /backup/yolov4_1000.weights -gpus 0,1,2,3

If you get a Nan, then for some datasets better to decrease learning rate, for 4 GPUs set learning_rate = 0,00065 (i.e. learning_rate = 0.00261 / GPUs). In this case also increase 4x times burn_in = in your cfg-file. I.e. use burn_in = 4000 instead of 1000.

Train custom datasets

Configuration setup see: Train YOLO v4 on Custom Dataset

Start training:

./darknet detector train data/obj.data <custom-cfg> yolov4.conv.137

File <custom-cfg>_last.weights will be saved to backup/ for each 100 iterations
File <custom-cfg>_xxxx.weights will be saved to backup/ for each 1000 iterations
if you train on server without monitor, disable Loss-window by using argument --dont_show. I.e.
```
./darknet detector train data/obj.data <custom-cfg> yolov4.conv.137 -dont_show
```
To see the mAP & Loss-chart during training on remote server without GUI, use
```
./darknet detector train data/obj.data <custom-cfg> yolov4.conv.137 -dont_show -mjpeg_port 8090 -map
```
Then open URL http://ip-address:8090 in browser
For training with mAP calculation for each 4 Epochs, you need to
- set valid=valid.txt or train.txt in obj.data file
- run training with -map argument
```
./darknet detector train data/obj.data <custom-cfg> yolov4.conv.137 -map
```
After training is complete - get result yolo-obj_final.weights from backup/
After each 100 iterations you can stop and later start training from this point. For example, after 2000 iterations you can stop training, and later just start training using:
```
./darknet detector train data/obj.data <custom-cfg> backup/yolo-obj_2000.weights
```
You can get result earlier than all 45000 iterations.

Notes 📝

If during training you see nan values for avg (loss) field, then training goes wrong. 😭

But if nan is in some other lines, then training goes well. 🙏
If you changed width= or height= in your cfg-file, then new width and height must be divisible by 32.
If error Out of memory occurs then in .cfg-file you should increase subdivisions=16, 32 or 64

When should I stop training ⁴

Usually sufficient 2000 iterations for each class(object),
- but NOT less than number of training images and
- NOT less than 6000 iterations in total.
During training, you will see varying indicators of error, and you should stop when no longer decreases 0.XXXXXXX avg
For example

9002: 0.211667, 0.60730 avg, 0.001000 rate, 3.868000 seconds, 576128 images Loaded: 0.000000 seconds
- 9002 - iteration number (number of batch)
- 0.60730 avg - average loss (error) - the lower, the better
he final avgerage loss can be from 0.05 (for a small model and easy dataset) to 3.0 (for a big model and a difficult dataset).
if you train with flag -map then you will see mAP indicator like Last accuracy mAP@0.5 = 18.50% in the console. This indicator is better than Loss, so keep training while mAP increases.

Choose the best weights

Once training is stopped, you should take some of last .weights-files from backup/ and choose the best of them.

For example, you stopped training after 9000 iterations, but the best result can give one of previous weights (7000, 8000, 9000). It can happen due to overfitting.

In order to choose best weight, just train with -map flag

./darknet detector train data/obj.data <custom-cfg> yolov4.conv.137 -dont_show -map

So you will see mAP-chart (red-line) in the Loss-chart Window looks like the following figure. mAP will be calculated for each 4 Epochs using valid=valid.txt file that is specified in obj.data file (1 Epoch = images_in_train_txt / batch iterations)

How to improve object detection⁵

Before training

Set flag random=1 in your .cfg-file - it will increase precision by training Yolo for different resolutions
increase network resolution in your .cfg-file (height=608, width=608 or any value multiple of 32) - it will increase precision
Check that each object that you want to detect is mandatory labeled in your dataset - no one object in your data set should not be without label.
- In the most training issues, there are wrong labels in your dataset. Always check your dataset by using: https://github.com/AlexeyAB/Yolo_mark
My Loss is very high and mAP is very low, is training wrong?

–> Run training with -show_imgs flag at the end of training command, do you see correct bounded boxes of objects? If no, your training dataset is wrong.
For each object which you want to detect - there must be at least 1 similar object in the Training dataset with about the same: shape, side of object, relative size, angle of rotation, tilt, illumination.
- So desirable that your training dataset include images with objects at diffrent: scales, rotations, lightings, from different sides, on different backgrounds
- You should preferably have 2000 different images for each class or more, and you should train 2000*classes iterations or more
Desirable that your training dataset include images with non-labeled objects that you do not want to detect, i.e. negative samples without bounded box (empty .txt files). Use as many images of negative samples as there are images with objects.
More see: https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

After training, for detection:

Increase network-resolution by set in your .cfg-file (height=608 and width=608) or (height=832 and width=832) or (any value multiple of 32). This increases the precision and makes it possible to detect small objects.
It is not necessary to train the network again, just use .weights-file already trained for 416x416 resolution
To get even greater accuracy you should train with higher resolution 608x608 or 832x832.
- Note: if error Out of memory occurs then in .cfg-file you should increase subdivisions=16, 32 or 64

Useful resources

Tips from Roboflow: YOLOv4 - Ten Tactics to Build a Better Model
Articles from Aleksey Bochkovskiy (author of YOLOv4)
- YOLOv4 — the most accurate real-time neural network on MS COCO dataset.
- Scaled YOLO v4 is the best neural network for object detection on MS COCO dataset
DARKNET FAQ

YOLOv5: Train Custom Dataset

Fri, 25 Dec 2020 00:00:00 +0000

We will learn

training YOLOv5 on our custom dataset
visualizing training logs
using trained YOLOv5 for inference
exporting trained YOLOv5 from PyTorch to other formats.

Clone YOLOv5 and install dependencies

git clone https://github.com/ultralytics/yolov5
cd yolov5
pip install -r requirements.txt

Prepare custom datasets

YOLO darknet format

Dataset in YOLO darknet format has the following structure:

There’s a file name _darknet.labels containing object names (one name per line).

For each .img file, there is a corresponding .txt file (same name, but with .txt-extension) in the same directory. I.e.

dataset
|- train
 |- _darknet.labels
 |- train_img_001.jpg
 |- train_img_001.txt
 ...
 |- train_img_xxx.jpg
 |- train_img_xxx.txt
|- valid # similar structure as train
|- test # similar structure as train

The *.txt file specifications are:
- One row per object
- Each row is class x_center y_center width height format.
- Box coordinates must be in normalized xywh format (from 0 - 1). If your boxes are in pixels, divide x_center and width by image width, and y_center and height by image height.
- Class numbers are zero-indexed (start from 0).
For example ¹:

The label file corresponding to the above image contains 2 persons (class 0) and a tie (class 27):

YOLOv5 format

YOLOv5 format:

If no objects in image, no *.txt file is required

YOLOv5 locates labels automatically for each image by replacing the last instance of /images/ in the images directory with /labels/. Therefore, folder structure of dataset should look like below:

dataset
|- images
 |- train
 |- train_img_001.jpg
 ...
 |- train_img_xxx.jpg
 |- valid
 |- test
|- labels
 |- train
 |- train_img_001.txt
 ...
 |- train_img_xxx.txt
 |- valid
 |- test

YOLO darknet format –> YOLOv5 format

Assuming we have a dataset in YOLO darknet format, we want to convert it to YOLOv5 format.

from pathlib import Path
from shutil import rmtree, copy2
from tqdm import tqdm

def copy_files(src_dir, dest_dir, ext="jpg"):
 """
 Copy files with the same extension from source directory to destination directory

 Parameters
 ----------
 src_dir : str
 source directory
 dest_dir : str
 destination directory
 ext : str, optional
 extension of files to be moved, by default "jpg"
 """
 for file in tqdm(Path(src_dir).glob(f"*.{ext}"), desc=f"Copying .{ext} files from {src_dir} to {dest_dir}"):
 copy2(file, dest_dir)


def convert_dataset_darknet_to_yolov5(src_dir_darknet, dest_dir_yolov5, dataset_types=["train", "valid", "test"]):
 """
 Convert dataset from YOLO darknet format to scaled YOLOv4 format

 Parameters
 ----------
 src_dir_darknet : str
 source dataset in YOLO darknet format
 dest_dir_scaled_yolov4 : str
 destination dataset in scaled YOLOv4 format
 dataset_types : list, optional
 types of dataset, by default ["train", "valid"]
 """
 dest_dir_yolov5 = Path(dest_dir_yolov5)
 if dest_dir_yolov5.exists():
 rmtree(dest_dir_yolov5)

 dest_dir_yolov5.mkdir()

 for dir in ["images", "labels"]:
 for dataset_type in dataset_types:
 dest_dir = dest_dir_yolov5.joinpath(f"{dir}", f"{dataset_type}")
 dest_dir.mkdir(parents=True)

 src_dir = Path(src_dir_darknet).joinpath(f"{dataset_type}")

 ext = "jpg" if dir == "images" else "txt"
 copy_files(src_dir, dest_dir, ext=ext)

 print(f"Copy {dir} from {src_dir} to {dest_dir} done!")

Define training configuration

For training we need to configure a .yaml file which specifies

download commands/URL for auto-downloading (optional)
the path of training and validation folder
number of classes
classes names

and put this .yaml file in yolov5/data/.

For example, let’s say we have custom-dataset folder in YOLOv5 format next to yolov5. This custom dataset containes 3 object classes: “cat”, “dog”, “monkey”.

Then yolov5/data/custom-dataset.yaml should look like:

train: ../custom-dataset/images/train
valid: ../custom-dataest/images/valid

nc: 3
names: ["cat", "dog", "monkey"]

Select a model

Select a pretrained model to start training from ²:

Model	APval	APtest	AP50	SpeedGPU	FPSGPU	params	GFLOPS
YOLOv5s	37.0	37.0	56.2	2.4ms	416	7.5M	17.5
YOLOv5m	44.3	44.3	63.2	3.4ms	294	21.8M	52.3
YOLOv5l	47.7	47.7	66.5	4.4ms	227	47.8M	117.2
YOLOv5x	49.2	49.2	67.7	6.9ms	145	89.0M

For example, we select YOLOv5s, the smallest and fastest model available. (YOLOv5m, YOLOv5l, YOLOv5x work similarly.)

In order to use YOLOv5s for training on custom dataset, we need to adjust models/yolov5s.yaml: change number of class nc according to our custom dataset. Following the example above, the value of nc is 3.

models_dir = "yolov5/models"
yolov5s = os.path.join(models_dir, "yolov5s.yaml")
yolov5s_custom = os.path.join(models_dir, "yolov5s_custom.yaml")

num_class = 3

with open(yolov5s, "r") as reader, open(yolov5s_custom, "w") as writer:
 lines = reader.readlines()

 # change number of classes according to custom dataset
 lines[1] = f"nc: {num_class} # number of classes\n"

 writer.writelines(lines)

Train

Now we’re ready for training YOLOv5 on our custom dataset.

To kick off training, we execute train.py with the following options:

img: define input image size
batch: determine batch size
epochs: define the number of training epochs. (Note: often, 3000+ are common here!)
data: set the path to our yaml file
cfg: specify our model configuration
weights: specify a custom path to weights.
- Use pretrained weights (recommended): --weights yolov5s.pt
  
  (Pretrained weights are auto-downloaded from the latest YOLOv5 release.)
- Use randomly initialized weights (NOT recommended!): --weights ''
name: result names
nosave: only save the final checkpoint
cache: cache images for faster training

python train.py --img 416 --batch 16 --epochs 1000 --data ./data/masks.yaml --cfg ./models/yolov5s_masks.yaml --weights yolov5s.pt --cache-images

Training logging

All training results are saved to runs/train/ with incrementing run directories, i.e. runs/train/exp, runs/train/exp1, runs/train/exp2, etc.
We can view training losses and performance metrics using Tensorboard
- If training on Google Colab:
```
%load_ext tensorboard
%tensorboard --logdir runs
```
Training losses and performance metrics are also saved to a logfile.
- If given no name, it defaults to results.txt. We can also specify the name with --name flag when we train.
- results.png contains plotting of different metrics

Run inference with trained weights

Trained weights are saved by default in runs/train/exp/weights folder.
- The best weights best.pt and the last weights last.pt are saved

For inference we use detect.py

python detect.py --weights ./runs/train/exp/weights/best.pt --img 416 --conf-thres 0.5 --source <path-to-test-set>

Export a trained YOLOv5 model

Install dependencies
Use models/export.py to export to ONNX, TorchScript and CoreML formats

Google Colab Notebook

Open in Colab

Reference

YOLOv5 repo: ultralytics/yolov5
- Developed actively
- Tutorials
Tutorials
- Official tutorials from YOLOv5 repo
  - Train Custom Data
  - ONNX and TorchScript Export
- Roboflow tutorials
  - Blog post: How to Train YOLOv5 On a Custom Dataset
  - Google Colab Notebook
  - Video tutorial
- Very detailed tutorial and explanation: Yolov5 系列2— 如何使用Yolov5训练你自己的数据集
YOLOv5 explanation: 深入浅出Yolo系列之Yolov5核心基础知识完整讲解

Scaled YOLOv4

Tue, 05 Jan 2021 00:00:00 +0000

Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao (more commonly known by their GitHub monikers, WongKinYiu and AlexyAB) have propelled the YOLOv4 model forward by efficiently scaling the network’s design and scale, surpassing the previous state-of-the-art EfficientDet published earlier this year by the Google Research/Brain team.

Train scaled YOLOv4 (PyTorch)

The Scaled-YOLOv4 implementation is written in the YOLOv5 PyTorch framework. Training scaled YOLOv4 is similar to training YOLOv5.

Here is t he Scaled-YOLOv4 repo, though you will notice that WongKinYiu has provided it there predominantly for research replication purposes and there are not many instructions for training on your own dataset. To train on your own data, our guide on training YOLOv5 in PyTorch on custom data will be useful, as it is a very similar training procedure.

Tutorials from Roboflow:

My Colab Notebook: yolov4_scaled.ipynb

Train scaled YOLOv4 (Darknet)

YOLOv4-csp training is also supported by Darknet. Training yolov4-csp is similar to training yolov4 and yolov4-tiny. Slight difference:

For config file, use yolov4-csp.cfg
For pretrained weights, use yolov4-csp.weights
For pretrained convolutional layer weights, use yolov4-csp.conv.142

Reference

Scaled YOLOv4 paper
Github repo: WongKinYiu/ScaledYOLOv4 (Different size of model in different branch)
Blog post from AlexAB: Scaled YOLO v4 is the best neural network for object detection on MS COCO dataset
Tutorials blog posts:

YOLOv3: Train on Custom Dataset

Tue, 05 Jan 2021 00:00:00 +0000

Training YOLOv3 as well as YOLOv3 tiny on custom dataset is similar to training YOLOv4 and YOLOv4 tiny. Only some steps need to be adjusted for YOLOv3 and YOLOv3 tiny:

In step 1, we create our custom config file based on cfg/yolov3.cfg (YOLOv3) and cfg/yolov3-tiny.cfg (YOLOv3 tiny). Then adjust batch, subdivisions, steps, width, height, classes, and filters just as for YOLOv4.
In step 6, download different pretrained weights for the convolutional layers
- for yolov3.cfg, yolov3-spp.cfg (154 MB): darknet53.conv.74
- for yolov3-tiny-prn.cfg , yolov3-tiny.cfg (6 MB): yolov3-tiny.conv.11

Reference

Tutorial from darknet repo: How to train (to detect your custom objects)
How to train YOLOv3 on the custom dataset

Histogram of Oriented Gradients (HOG)

Sat, 20 Feb 2021 00:00:00 +0000

What is a Feature Descriptor

A feature descriptor is a representation of an image or an image patch that simplifies the image by extracting useful information and throwing away extraneous information.

Typically, a feature descriptor converts an image of size $\text{width} \times \text{height} \times 3 \text{(channels)}$ to a feature vector / array of length $n$. In the case of the HOG feature descriptor, the input image is of size $64 \times 128 \times 3$ and the output feature vector is of length $3780$.

HOG descriptor can be calculated for other sizes. Here we just stick to numbers presented in the original paper for the sake of simplicity.

How to calculate Histogram of Oriented Gradients?

In this section, we will go into the details of calculating the HOG feature descriptor. To illustrate each step, we will use a patch of an image.

1. Preprocessing

Typically patches at multiple scales are analyzed at many image locations. The only constraint is that the patches being analyzed have a fixed aspect ratio.

In our case, the patches need to have an aspect ratio of 1:2. For example, they can be 100×200, 128×256, or 1000×2000 but not 101×205.

For the example image of size 720x475 below, we select a patch of size 100x200 for calculating HOG feature descriptor. This patch is then cropped out of an image and resized to 64×128.

2. Calculate the Gradient Images

To calculate a HOG descriptor, we need to first calculate the horizontal and vertical gradients; after all, we want to calculate the histogram of gradients.

Calculating the horizontal and vertical gradients is easily achieved by filtering the image with the following kernels (Sobel operator).

Kernels for gradient calculation (left: $x$-gradient, right: $y$-gradient).

Next, we can find the magnitude and direction of gradient using the following formula:

$$ \begin{array}{l} g=\sqrt{g\_{x}^{2}+g\_{y}^{2}} \\\\ \theta=\arctan \frac{g\_{y}}{g\_{x}} \end{array} $$

The figure below shows the gradients:

Left : Absolute value of x-gradient. Center : Absolute value of y-gradient. Right : Magnitude of gradient.

At every pixel, the gradient has a magnitude and a direction. For color images, the gradients of the three channels are evaluated ( as shown in the figure above ). The magnitude of gradient at a pixel is the maximum of the magnitude of gradients of the three channels, and the angle is the angle corresponding to the maximum gradient.

3. Calculate Histogram of Gradients in 8×8 cells

In this step, the image is divided into 8×8 cells and a histogram of gradients is calculated for each 8×8 cells.

Why divide into patches?
- Using feature descriptor to describe a patch of an images provides a compact representation.
- Not only is the representation more compact, calculating a histogram over a patch makes this represenation more robust to noise. Individual graidents may have noise, but a histogram over 8×8 patch makes the representation much less sensitive to noise.
Why 8x8 batchs?

It is a design choice informed by the scale of features we are looking for. HOG was used for pedestrian detection initially. 8×8 cells in a photo of a pedestrian scaled to 64×128 are big enough to capture interesting features ( e.g. the face, the top of the head etc. ).

Let’s look at one 8×8 patch in the image and see how the gradients look.

Center : The RGB patch and gradients represented using arrows. Right : The gradients in the same patch represented as numbers

The image in the center shows the patch of the image overlaid with arrows showing the gradient — the arrow shows the direction of gradient and its length shows the magnitude. The direction of arrows points to the direction of change in intensity and the magnitude shows how big the difference is.
On the right, gradient direction is represented by angles between 0 and 180 degrees instead of 0 to 360 degrees. These are called “unsigned” gradients because a gradient and it’s negative are represented by the same numbers. Empirically it has been shown that unsigned gradients work better than signed gradients for pedestrian detection.

The next step is to create a histogram of gradients in these 8×8 cells. The histogram contains 9 bins corresponding to angles 0, 20, 40 … 160 of $y$-axis.

The following figure illustrates the process. We are looking at magnitude and direction of the gradient of the same 8×8 patch as in the previous figure. A bin is selected based on the direction, and the vote ( the value that goes into the bin ) is selected based on the magnitude.

For the pixel encircled in blue: It has an angle ( direction ) of 80 degrees and magnitude of 2. So it adds 2 to the 5th bin (bin for angle 80).
For the pixel encircled in red: It has an angle of 10 degrees and magnitude of 4. Since 10 degrees is half way between 0 and 20, the vote by the pixel splits evenly into the two bins.

One more detail to be aware of: If the angle is greater than 160 degrees, it is between 160 and 180, and we know the angle wraps around making 0 and 180 equivalent. So in the example below, the pixel with angle 165 degrees contributes proportionally to the 0 degree bin and the 160 degree bin.

The contributions of all the pixels in the 8×8 cells are added up to create the 9-bin histogram. For the patch above, it looks like this

As aforementioned, the $y$-axis is 0 degrees. We can see the histogram has a lot of weight near 0 and 180 degrees, which is just another way of saying that in the patch gradients are pointing either up or down.

4. 16×16 Block Normalization

In the last step, we created a histogram based on the gradient of the image. However, gradients of an image are sensitive to overall lighting. Ideally, we want our descriptor to be independent of lighting variations. In other words, we would like to “normalize” the histogram so they are not affected by lighting variations.

Instead of normalizing just a single 8x8 cell, we’ll normalize over a bigger sized block of 16×16. A 16×16 block has 4 histograms which can be concatenated to form a 36 x 1 element vector. The window is then moved by 8 pixels (see animation) and a normalized 36×1 vector is calculated over this window and the process is repeated.

5. Calculate the HOG feature vector

To calculate the final feature vector for the entire image patch, the 36×1 vectors are concatenated into one giant vector:

Number of 16x16 blocks: $7 \times 15 = 105$
Each 16x16 block is represented by a $36\times1$ vector

Therefore, the giant vector has the dimension $36 \times 105 = 3780$

HOG visualization

from skimage import io
from skimage.feature import hog
from skimage import data, exposure
from google.colab.patches import cv2_imshow
import matplotlib.pyplot as plt

image = io.imread('https://pic4.zhimg.com/80/v2-2ccc671e60031942dca8a129410a0383_720w.jpg')

fd, hog_image = hog(image, orientations=8, pixels_per_cell=(16, 16),
 cells_per_block=(1, 1), visualize=True, multichannel=True)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 10), sharex=True, sharey=True)
ax1.axis('off')
ax1.imshow(image, cmap=plt.cm.gray)
ax1.set_title('Input image')

# Rescale histogram for better display
hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 10))
ax2.axis('off')
ax2.imshow(hog_image_rescaled, cmap=plt.cm.gray)
ax2.set_title('HOG')

plt.show()

Reference

Histogram of Oriented Gradients: clear and detailed explanation 👍
HOG特征详解: HOG visualization
Video explanation:

Overview of Region-based Object Detectors

Sat, 20 Feb 2021 00:00:00 +0000

Sliding-window detectors

A brute force approach for object detection is to slide windows from left and right, and from up to down to identify objects using classification. To detect different object types at various viewing distances, we use windows of varied sizes and aspect ratios.

We cut out patches from the picture according to the sliding windows. The patches are warped since many classifiers take fixed size images only.

Warp an image to a fixed size image

The warped image patch is fed into a CNN classifier to extract 4096 features. Then we apply a SVM classifier to identify the class and another linear regressor for the boundary box.

System flow:

Pseudo-code:

for window in windows:
 patch = get_patch(image, window)
 results = detector(patch)

We create many windows to detect different object shapes at different locations. To improve performance, one obvious solution is to reduce the number of windows.

Selective Search

Instead of a brute force approach, we use a region proposal method to create regions of interest (ROIs) for object detection.

In selective search (SS)

We start with each individual pixel as its own group
We calculate the texture for each group and combine two that are the closest ( to avoid a single region in gobbling others, we prefer grouping smaller group first).
We continue merging regions until everything is combined together.

The figure below illustrates this process:

In the first row, we show how we grow the regions, and the blue rectangles in the second rows show all possible ROIs we made during the merging.

R-CNN ¹

Region-based Convolutional Neural Networks (R-CNN )

Uses of a region proposal method to create about 2000 ROIs (regions of interest).
The regions are warped into fixed size images and feed into a CNN network individually.
Uses fully connected layers to classify the object and to refine the boundary box.

R-CNN uses regional proposals, CNN, FC layers to locate objects.

System flow:

Pseudo-code:

ROIs = region_proposal(image) # RoI from a proposal method (~2k)
for ROI in ROIs:
 patch = get_patch(image, ROI)
 results = detector(patch)

With far fewer but higher quality ROIs, R-CNN run faster and more accurate than the sliding windows. However, R-CNN is still very slow, because it need to do about 2k independent forward passes for each image! 🤪

Fast R-CNN ²

How does Fast R-CNN work?

Instead of extracting features for each image patch from scratch, we use a feature extractor (a CNN) to extract features for the whole image first.
We also use an external region proposal method, like the selective search, to create ROIs which later combine with the corresponding feature maps to form patches for object detection.
We warp the patches to a fixed size using ROI pooling and feed them to fully connected layers for classification and localization (detecting the location of the object).

Fast R-CNN apply region proposal on feature maps and form fixed size patches using ROI pooling.

Fast R-CNN vs. R-CNN

System flow:

Pseudo-code:

feature_maps = process(image)
ROIs = region_proposal(image)
for ROI in ROIs:
 patch = roi_pooling(feature_maps, ROI)
 results = detector2(patch)

The expensive feature extraction is moving out of the for-loop. This is a significant speed improvement since it was executed for all 2000 ROIs. 👏

One major takeaway for Fast R-CNN is that the whole network (the feature extractor, the classifier, and the boundary box regressor) are trained end-to-end with multi-task losses (classification loss and localization loss). This improves accuracy.

ROI pooling

Because Fast R-CNN uses fully connected layers, we apply ROI pooling to warp the variable size ROIs into in a predefined fix size shape.

Let’s take a look at a simple example: transforming 8 × 8 feature maps into a predefined 2 × 2 shape.

Top left: feature maps
Top right: we overlap the ROI (blue) with the feature maps.
Bottom left: we split ROIs into the target dimension. For example, with our 2×2 target, we split the ROIs into 4 sections with similar or equal sizes.
Bottom right: find the maximum for each section (i.e, max-pool within each section) and the result is our warped feature maps.

Now we get a 2 × 2 feature patch that we can feed into the classifier and box regressor.

Another gif example:

Problems of Fast R-CNN

Fast R-CNN depends on an external region proposal method like selective search. However, those algorithms run on CPU and they are slow. In testing, Fast R-CNN takes 2.3 seconds to make a prediction in which 2 seconds are for generating 2000 ROIs!!!

feature_maps = process(image)
ROIs = region_proposal(image) # Expensive!
for ROI in ROIs:
 patch = roi_pooling(feature_maps, ROI)
 results = detector2(patch)

Faster R-CNN ³: Make CNN do proposals

Faster R-CNN adopts similar design as the Fast R-CNN except

it replaces the region proposal method by an internal deep network called Region Proposal Network (RPN)
the ROIs are derived from the feature maps instead.

System flow: (same as Fast R-CNN)

The network flow is similar but the region proposal is now replaced by a internal convolutional network, Region Proposal Network (RPN).

The external region proposal is replaced by an internal Region Proposal Network (RPN).

Pseudo-code:

feature_maps = process(image)
ROIs = region_proposal(feature_maps) # use RPN
for ROI in ROIs:
 patch = roi_pooling(feature_maps, ROI)
 class_scores, box = detector(patch)
 class_probabilities = softmax(class_scores)

Region proposal network (RPN)

The region proposal network (RPN)

takes the output feature maps from the first convolutional network as input
slides 3 × 3 filters over the feature maps to make class-agnostic region proposals using a convolutional network like ZF network

ZF network

Other deep network likes VGG or ResNet can be used for more comprehensive feature extraction at the cost of speed.
The ZF network outputs 256 values, which is feed into 2 separate fully connected (FC) layers to predict a boundary box and 2 objectness scores.
- The objectness measures whether the box contains an object. We can use a regressor to compute a single objectness score but for simplicity, Faster R-CNN uses a classifier with 2 possible classes: one for the “have an object” category and one without (i.e. the background class).
For each location in the feature maps, RPN makes $k$ guesses

$\Rightarrow$ RPN outputs $4 \times k$ coordinates (top-left and bottom-right $(x, y)$ coordinates) for bounding box and $2 \times k$ scores for objectness (with vs. without object) per location
- Example: $8 \times 8$ feature maps with a $3 \times 3$ filter, and it outputs a total of $8 \times 8 \times 3$ ROIs (for $k = 3$)
  - Here we get 3 guesses and we will refine our guesses later. Since we just need one to be correct, we will be better off if our initial guesses have different shapes and size.
    
    Therefore, Faster R-CNN does not make random boundary box proposals. Instead, it predicts offsets like $\delta\_x, \delta\_y$ that are relative to the top left corner of some reference boxes called anchors. We constraints the value of those offsets so our guesses still resemble the anchors.
  - To make $k$ predictions per location, we need $k$ anchors centered at each location. Each prediction is associated with a specific anchor but different locations share the same anchor shapes.
  - Those anchors are carefully pre-selected so they are diverse and cover real-life objects at different scales and aspect ratios reasonable well.
    - This guides the initial training with better guesses and allows each prediction to specialize in a certain shape. This strategy makes early training more stable and easier. 👍
Faster R-CNN uses far more anchors. It deploys 9 anchor boxes: 3 different scales at 3 different aspect ratio. Using 9 anchors per location, it generates 2 × 9 objectness scores and 4 × 9 coordinates per location.

Anchors are also called priors or default boundary boxes in different papers.

Nice example and explanation from Stanford cs231n slide

Region-based Fully Convolutional Networks (R-FCN) ⁴

💡 Idea

Let’s assume we only have a feature map detecting the right eye of a face. Can we use it to locate a face? It should. Since the right eye should be on the top-left corner of a facial picture, we can use that to locate the face.

If we have other feature maps specialized in detecting the left eye, the nose or the mouth, we can combine the results together to locate the face better.

Problem of Faster R-CNN

In Faster R-CNN, the detector applies multiple fully connected layers to make predictions. With 2,000 ROIs, it can be expensive.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
for ROI in ROIs
 patch = roi_pooling(feature_maps, ROI)
 class_scores, box = detector(patch) # Expensive!
 class_probabilities = softmax(class_scores)

R-FCN: reduce the amount of work needed for each ROI

R-FCN improves speed by reducing the amount of work needed for each ROI. The region-based feature maps above are independent of ROIs and can be computed outside each ROI. The remaining work is then much simpler and therefore R-FCN is faster than Faster R-CNN.

Pseudo-code:

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
score_maps = compute_score_map(feature_maps)
for ROI in ROIs:
 V = region_roi_pool(score_maps, ROI)
 class_scores, box = average(V) # Much simpler!
 class_probabilities = softmax(class_scores)

Position-sensitive score mapping

Let’s consider a 5 × 5 feature map M with a blue square object inside. We divide the square object equally into 3 × 3 regions.

Now, we create a new feature map from M to detect the top left (TL) corner of the square only. The new feature map looks like the one on the right below. Only the yellow grid cell [2, 2] is activated.

Create a new feature map from the left to detect the top left corner of an object.

Since we divide the square into 9 parts, we can create 9 feature maps each detecting the corresponding region of the object. These feature maps are called position-sensitive score maps because each map detects (scores) a sub-region of the object.

Let’s say the dotted red rectangle below is the ROI proposed. We divide it into 3 × 3 regions and ask how likely each region contains the corresponding part of the object.

For example, how likely the top-left ROI region contains the left eye. We store the results into a 3 × 3 vote array in the right diagram below. For example, vote_array[0][0] contains the score on whether we find the top-left region of the square object.

Apply ROI onto the feature maps to output a 3 x 3 array.

This process to map score maps and ROIs to the vote array is called position-sensitive ROI-pool.

Overlay a portion of the ROI onto the corresponding score map to calculate `V[i][j]`

After calculating all the values for the position-sensitive ROI pool, the class score is the average of all its elements.

Data flow

Let’s say we have $C$ classes to detect.
We expand it to $C + 1$ classes so we include a new class for the background (non-object). Each class will have its own $3 \times 3$ score maps and therefore a total of $(C+1) \times 3 \times 3$ score maps.
Using its own set of score maps, we predict a class score for each class.
Then we apply a softmax on those scores to compute the probability for each class.

Data flow of R-FCN ($k=3$)

Reference

What do we learn from region based object detectors (Faster R-CNN, R-FCN, FPN)? - A nice and clear comprehensive tutorial for region-based object detectors
Stanford CS231n slides
關於影像辨識，所有你應該知道的深度學習模型
一文读懂目标检测：R-CNN、Fast R-CNN、Faster R-CNN、YOLO、SSD
RoI pooling: Understanding Region of Interest — (RoI Pooling)

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 580–587. https://doi.org/10.1109/CVPR.2014.81 ↩︎
Girshick, R. (2015). Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision, 2015 International Conference on Computer Vision, ICCV 2015, 1440–1448. https://doi.org/10.1109/ICCV.2015.169 ↩︎
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031 ↩︎
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. Advances in Neural Information Processing Systems, 379–387. ↩︎

Object Detection | Haobin Tan

Object Detection

Evaluation Metrics for Object Detection

Precision & Recall

IoU (Intersection over union)

AP (Average Precision)

Smoothing the Precision-Recall-Curve

Interpolated AP

AP (Area under curve AUC)

Reference

COCO JSON Format for Object Detection

Info

Lincenses

Images

Annotations

Object detection

Categories

Annotations

Example

Reference

You Only Look Once (YOLO)

Overview: How does YOLO work?

Labels for training

Example

Training

👍 Advantages

Intersection over Union (IoU)

Intersection over Union (IoU). Src: [a-PyTorch-Tutorial-to-Object-Detection](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection)

Non-max suppresion

Each car has two or more detections with different probabilities. The reason is that some of the grids that thinks that they contain the center point of the object. Src: [a-PyTorch-Tutorial-to-Object-Detection](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection)

Non-max suppression example. Src: [An overview of object detection: one-stage methods.](https://www.jeremyjordan.me/object-detection-one-stage/)

Anchor box

How to choose the anchor boxes?

Putting them all together

Making predictions

Outputing the non-max supressed outputs

Reference

YOLOv4: Run Pretrained YOLOv4 on COCO Dataset

Clone and build DarkNet

Download pretrained YOLO v4 weights

Define helper functions

Run detections with Darknet and YOLOv4

Run detections using uploaded image

Reference

YOLOv4: Train on Custom Dataset

Clone and build Darknet

Compile on Linux using make

Prepare custom dataset

Configure files for training

Start training

Notes

Train tiny-YOLO

Google Colab Notebook

Small hacks to keep colab notebook training

Convert YOLOv4 to TensorRT through ONNX

Convert YOLOv4 for custom trained models

Reference

Annotation Conversion: COCO JSON to YOLO Txt

Bounding box formats comparison and conversion

Bounding box format: COCO vs YOLO

Convert COCO JSON to YOLO txt

Example

Reference

YOLOv4: Training Tips

Model zoo

FAQ

Low accuracy 1

The most common problem - you do NOT follow strictly the manual.

Your datasets are wrong.

Darknet training/detection crashes with an error 2

Train with multiple GPUs 3

Train custom datasets

Notes 📝

When should I stop training 4

Choose the best weights

How to improve object detection5

Other questions

Will darknet automaticly resize the image size?

Does the network have to be perfectly square?

Detection with aspect ratio change

Compile on Linux using `make`

Low accuracy ¹

Darknet training/detection crashes with an error ²

Train with multiple GPUs ³

When should I stop training ⁴

How to improve object detection⁵

R-CNN ¹

R-CNN uses regional proposals, CNN, FC layers to locate objects.

Fast R-CNN ²

Fast R-CNN apply region proposal on feature maps and form fixed size patches using ROI pooling.

Faster R-CNN ³: Make CNN do proposals

Region-based Fully Convolutional Networks (R-FCN) ⁴