YOLO | Haobin Tan

You Only Look Once (YOLO)

Wed, 04 Nov 2020 00:00:00 +0000

The problem of sliding windows method is that it does not output the most accuracte bounding boxes. A good way to get this output more accurate bounding boxes is with the YOLO (You Only Look Once) algorithm.

Overview: How does YOLO work?

Let’s say we have an input image (e.g. at 100x100), we’re going to place down a grid on this image. For the purpose of simplicity and illustration, we’re going to use a 3x3 grid as example.

(In an actual implementation, we’ll use a finer one, like 19x19 grid)

Labels for training

For each grid cell, we specify a target label $\mathbf{y}$:

$$ \mathbf{y} = \left( \begin{array}{c} P\_c \\\\ b\_x \\\\ b\_y \\\\ b\_h \\\\ b\_w \\\\ c\_1 \\\\ c\_2 \\\\ \vdots \\\\ c\_n \end{array} \right) \in \mathbb{R}^{5 + n} $$

$P\_c$: objectness
- depends on whether there’s an object in that grid cell.
- If yes, then $P\_c = 1$. else $P\_c=0$
Bounding box coordinates
- $b\_x, b\_y \in (0, 1)$: describe the center point of the object relative to the grid cell
  - If $>1$, then the center point is outside of the current grid cell and it should be assigned to another grid cell
  - Some parameterizations also use Sigmoid function to ensure $b\_x, b\_y \in (0, 1)$
- $b\_h, b\_w$: height and width of the bounding box,
  - specified as a fraction of the overall width of the grid cell (can be $\geq 1$)
  - Some parameterizations also use exponential function to ensure non-negativity
$c\_1, c\_2, \dots, c\_n$: object classes probabilities we want to detect
- E.g. we want to detect 3 classes of object:
  - pedestrian ($c\_1$),
  - car ($c\_2$),
  - motorcycle ($c\_3$),
  so our target $\mathbf{y}$ will be:
  $$ \mathbf{y} = \left( \begin{array}{c} P\_c \\\\ b\_x \\\\ b\_y \\\\ b\_h \\\\ b\_w \\\\ c\_1 \\\\ c\_2 \\\\ c\_3 \end{array} \right) \in \mathbb{R}^{8} $$

Example

If we consider the upper left grid cell (at position $(0, 0)$)

There’s no object in this grid cell, so $P\_c = 0$, and we don’t have to care for the rest elements of $\mathbf{y}$:

$$ \mathbf{y} = \left( \begin{array}{c} 0 \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \end{array} \right) \in \mathbb{R}^{8} $$

Here we use the symbol ? to mark “don’t care”.

However, neural network can’t output a question mark, can’t output a “don’t care”. So wes’ll put some numbers for the rest. But these numbers will basically be ignored because the neural network is telling you that there’s no object there. So it doesn’t really matter whether the output is a bounding box or there’s is a car. So basically just be some set of numbers, more or less noise.

Now, how about the grid cells in the second row?

To give a bit more detail, this image has two objects. And what the YOLO algorithm does is

it takes the midpoint of reach of the two objects and then assigns the object to the grid cell containing the midpoint. So the left car is assigned to the left grid cell marked with green; and the car on the rightis assigned to the grid cell marked with yellow.
- For the left grid cell marked with green, the target label $\mathbf{y}$ would be as follows: $$ \mathbf{y} = \left( \begin{array}{c} 1 \\\\ b\_x \\\\ b\_y \\\\ b\_h \\\\ b\_w \\\\ 0 \\\\ 1 \\\\ 0 \end{array} \right) $$
Even though the central grid cell has some parts of both cars, we’ll pretend the central grid cell has no interesting object. So the class label of the central grid cell is
$$ \mathbf{y} = \left( \begin{array}{c} 0 \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \\\\ ? \end{array} \right) $$

For each of these 9 grid cells, we end up with a 8 dimensional output vector. So the total target output volume is $(3 \times 3) \times 8$.

Generally speaking, assuming that we have $n \times n$ grid cells, and we want to detect $C$ classes of objects, then the target output volume will be $(n \times n) \times (5 + C)$.

Training

To train our neural network, the input is $100 \times 100 \times 3$. And then we have a usual convnet with conv, layers of max pool layers, and so on. So that in the end, this eventually maps to a $3 \times 3 \times 8$ output volume. And so what we do is we have an input $X$ which is the input image like that, and we have these target labels $\mathbf{y}$ which are $3 \times 3 \times 8$, and we use backpropagation to train the neural network to map from any input $X$ to this type of output volume $\mathbf{y}$.

👍 Advantages

The neural network outputs precise bounding boxes 👏
Effeicient and fast thanks to convolution operations 👏

Intersection over Union (IoU)

How can we tell whether our object detection algorithm is working well?

The Intersection-over-Union (IoU), aka Jaccard Index or Jaccard Overlap, measure the degree or extent to which two boxes overlap.

Intersection over Union (IoU). Src: [a-PyTorch-Tutorial-to-Object-Detection](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection)

In object detection:

$$ \text{IoU} = \frac{\text{Overlapping region between ground truth and prediction bounding box}}{\text{Combined region of ground truth and prediction bounding box}} $$

If $\text{IoU} \geq \text{threshold}$, we would say the prediction is correct.

By convention, $\text{threshold} = 0.5$. We can also chosse other value greater than 0.5.

Example:

IoU example. Src: [026 CNN Intersection over Union | Master Data Science](https://www.google.com/url?sa=i&url=http%3A%2F%2Fdatahacker.rs%2Fdeep-learning-intersection-over-union%2F&psig=AOvVaw2K4pvRAkwPw3FZYIelxngf&ust=1604671149058000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCIjNgoLI6-wCFQAAAAAdAAAAABAg)

Non-max suppresion

One of the problems we have addressed in YOLO is that it can detect the same object multiple times.

For example:

Each car has two or more detections with different probabilities. The reason is that some of the grids that thinks that they contain the center point of the object. Src: [a-PyTorch-Tutorial-to-Object-Detection](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection)

Non-max Suppression is a way to make sure that YOLO detects the object just once. It cleans up redundant detections. So they end up with just one detection per object, rather than multiple detections per object.

Takes the detection with the largest $P\_c$ (the probability of a detection) (“That’s my most confident detection, so let’s highlight that and just say I found the car there.”)
Looks at all of the remaining rectangles and all the ones with a high overlap (i.e. with a high IOU), just suppress/darken/discard them

Example:

Non-max suppression example. Src: [An overview of object detection: one-stage methods.](https://www.jeremyjordan.me/object-detection-one-stage/)

For multi-class detection, non-max suppression should be carried out on each class separately.

Anchor box

One of the problems with object detection as we have seen it so far is that each of the grid cells can detect only one object. What if a grid cell wants to detect multiple objects?

For example: we want to detect 3 classes (pedestrians, cars, motorcycles), and our input image looks like this:

The midpoint of the pedestrian and the midpoint of the car are in almost the same place and both of them fall into the same grid cell. If the output vector

$$ \mathbf{y} = \left( \begin{array}{c} P\_c \\\\ b\_x \\\\ b\_y \\\\ b\_h \\\\ b\_w \\\\ c\_1 \\\\ c\_2 \\\\ c\_3 \end{array} \right) $$

we have seen before, it won’t be able to output two detections 😢.

With the idea of anchor boxes, we are going to

pre-defne a number of different shapes of anchor boxes (in this example, just 2)

and associate them in the class labels
$$ \mathbf{y} = \left(\underbrace{P\_c, b\_x, b\_y, b\_h, b\_w, c\_1, c\_2, c\_3}\_{\text{anchor box 1}} , \underbrace{P\_c, b\_x, b\_y, b\_h, b\_w, c\_1, c\_2, c\_3}\_{\text{anchor box 2}}\right)^T \in \mathbb{R}^{16} $$
- Because the shape of the pedestrian is more similar to the shape of anchor box 1 than anchor box 2, we can use the first eight numbers to encode pedestrian.
- Because the box around the car is more similar to the shape of anchor box 2 than anchor box 1, we can then use the second 8 numbers to encode that the second object here is the car

To summarise, with a number of pre-defined anchor boxes: Each object in training image is assigned to

the grid cell that contains object’s midpoint and
anchor box for the grid cell with the highest IoU with the ground truth bounding box

In other words, now the object is assigned to a $(\text{grid cell}, \text{anchor box})$ pair.

we have pre-defined $B$ different size of bounding boxes
the size of input image is $n \times n$
we want to detect $C$ classes of objects

Then the output volume will be

$$ (n \times n) \times B(5 + C) $$

How to choose the anchor boxes?

People used to just choose them by hand or choose maybe 5 or 10 anchor box shapes that spans a variety of shapes that seems to cover the types of objects to detect
A better way to do this is to use a K-means algorithm, to group together two types of objects shapes we tend to get. (in the later YOLO research paper)

Putting them all together

Suppose we’re trying to train a model to detect three classes of objects:

pedestrians
cars
motorcycles

And the input image looks like this:

Suppose we have pre-defined two different sizes of bounding boxes

Anchor box 2 has a higher IoU with the ground truth bounding box of the car, then:

The final output volume is $3 \times 3 \times 2 \times 8$

Making predictions

Outputing the non-max supressed outputs

Let’s look at an new input image,

and suppose that we still use 2 pre-defined anthor boxes for detecting pedestrians, cars, and motorcycles.

For each grid cell, get 2 predicted bounding boxes. Notice that some of the bounding boxes can go outside the height and width of the grid cell that they came from
Get rid of low probability predictions
For each class, use non-max suppression to generate final predictions. And so the output of this is hopefully that we will have detected all the cars and all the pedestrians in this image.

Reference

Annotation Conversion: COCO JSON to YOLO Txt

Wed, 02 Dec 2020 00:00:00 +0000

Bounding box formats comparison and conversion

In COCO Json, the format of bounding box is:

"bbox": [
 <absolute_x_top_left>,
 <absolute_y_top_left>,
 <absolute_width>,
 <absolute_height>
]

However, the annotation is different in YOLO. For each .jpg image, there’s a .txt file (in the same directory and with the same name, but with .txt-extension). This .txt file holds the objects and their bounding boxes in this image (one line for each object), in the following format ¹:

<object-class> <relative_x_center> <relative_y_center> <relative_width> <relative_height>

<object-class> : integer number of object from 0 to (classes-1)
<relative_x_center> <relative_y_center> <relative_width> <relative_height>

float values relative to width and height of image (equal from (0.0 to 1.0])

For example, for img1.jpg there should be img1.txt containing something looks like followings:

1 0.716797 0.395833 0.216406 0.147222
0 0.687109 0.379167 0.255469 0.158333
2 0.420312 0.395833 0.140625 0.166667

The following figure illustrates the difference of bounding box annotation between COCO and YOLO:

Bounding box format: COCO vs YOLO

Convert the bounding box annotation format from COCO to YOLO:

$$ \begin{array}{ll} x\_{yolo} &= (x\_{coco} + \frac{w\_{coco}}{2}) / w\_{img} \\\\ y\_{yolo} &= (y\_{coco} + \frac{h\_{coco}}{2}) / h\_{img} \\\\ w\_{yolo} &= w\_{coco} / w\_{img} \\\\ h\_{yolo} &= h\_{coco} / h\_{img} \end{array} $$

def convert_bbox_coco2yolo(img_width, img_height, bbox):
 """
 Convert bounding box from COCO format to YOLO format

 Parameters
 ----------
 img_width : int
 width of image
 img_height : int
 height of image
 bbox : list[int]
 bounding box annotation in COCO format:
 [top left x position, top left y position, width, height]

 Returns
 -------
 list[float]
 bounding box annotation in YOLO format:
 [x_center_rel, y_center_rel, width_rel, height_rel]
 """

 # YOLO bounding box format: [x_center, y_center, width, height]
 # (float values relative to width and height of image)
 x_tl, y_tl, w, h = bbox

 dw = 1.0 / img_width
 dh = 1.0 / img_height

 x_center = x_tl + w / 2.0
 y_center = y_tl + h / 2.0

 x = x_center * dw
 y = y_center * dh
 w = w * dw
 h = h * dh

 return [x, y, w, h]

Convert COCO JSON to YOLO txt

The structure of training set in COCO format is:

- train
 |- _annotations.coco.json
 |- img_001.jpg
 |- img_002.jpg
 |- img_003.jpg
 ...

_annotations.coco.json contains all information about the dataset, images, and annotations. (More see: COCO JSON Format for Object Detection)

The structure of training set in YOLO format is:

- train
 |- _darknet.labels
 |- img_001.jpg
 |- img_001.txt
 |- img_002.jpg
 |- img_002.txt
 |- img_003.jpg
 |- img_003.txt
 ...

_darknet.labels contains objects names, each in new line
For each .jpg image there’s a corresponding .txt file with the same name

Now we create .txt file for each image based on _annotations.coco.json:

import os
import json
from tqdm import tqdm
import shutil

def make_folders(path="output"):
 if os.path.exists(path):
 shutil.rmtree(path)
 os.makedirs(path)
 return path


def convert_coco_json_to_yolo_txt(output_path, json_file):

 path = make_folders(output_path)

 with open(json_file) as f:
 json_data = json.load(f)

 # write _darknet.labels, which holds names of all classes (one class per line)
 label_file = os.path.join(output_path, "_darknet.labels")
 with open(label_file, "w") as f:
 for category in tqdm(json_data["categories"], desc="Categories"):
 category_name = category["name"]
 f.write(f"{category_name}\n")

 for image in tqdm(json_data["images"], desc="Annotation txt for each iamge"):
 img_id = image["id"]
 img_name = image["file_name"]
 img_width = image["width"]
 img_height = image["height"]

 anno_in_image = [anno for anno in json_data["annotations"] if anno["image_id"] == img_id]
 anno_txt = os.path.join(output_path, img_name.split(".")[0] + ".txt")
 with open(anno_txt, "w") as f:
 for anno in anno_in_image:
 category = anno["category_id"]
 bbox_COCO = anno["bbox"]
 x, y, w, h = convert_bbox_coco2yolo(img_width, img_height, bbox_COCO)
 f.write(f"{category} {x:.6f} {y:.6f} {w:.6f} {h:.6f}\n")

 print("Converting COCO Json to YOLO txt finished!")

Example

Assuming we have a COCO Json file _annotations.coco.json:

{
 "info": {
 "year": "2020",
 "version": "1",
 "description": "Exported from roboflow.ai",
 "contributor": "Roboflow",
 "url": "https://app.roboflow.ai/datasets/hard-hat-sample/1",
 "date_created": "2000-01-01T00:00:00+00:00"
 },
 "licenses": [
 {
 "id": 1,
 "url": "https://creativecommons.org/publicdomain/zero/1.0/",
 "name": "Public Domain"
 }
 ],
 "categories": [
 {
 "id": 0,
 "name": "Workers",
 "supercategory": "none"
 },
 {
 "id": 1,
 "name": "head",
 "supercategory": "Workers"
 },
 {
 "id": 2,
 "name": "helmet",
 "supercategory": "Workers"
 },
 {
 "id": 3,
 "name": "person",
 "supercategory": "Workers"
 }
 ],
 "images": [
 {
 "id": 0,
 "license": 1,
 "file_name": "0001.jpg",
 "height": 275,
 "width": 490,
 "date_captured": "2020-07-20T19:39:26+00:00"
 }
 ],
 "annotations": [
 {
 "id": 0,
 "image_id": 0,
 "category_id": 2,
 "bbox": [
 45,
 2,
 85,
 85
 ],
 "area": 7225,
 "segmentation": [],
 "iscrowd": 0
 },
 {
 "id": 1,
 "image_id": 0,
 "category_id": 2,
 "bbox": [
 324,
 29,
 72,
 81
 ],
 "area": 5832,
 "segmentation": [],
 "iscrowd": 0
 }
 ]
}

convert_coco_json_to_yolo_txt("output", "_annotations.coco.json")

Categories: 100%|██████████| 4/4 [00:00<00:00, 2471.24it/s]
Annotation txt for each iamge: 100%|██████████| 1/1 [00:00<00:00, 1800.13it/s]
Converting COCO Json to YOLO txt finished!

An folder named output is created and has the structure:

- output
 |- 0001.txt
 |- _darknet.labels

Content of _darknet.labels:

Workers
head
helmet
person

Content of 0001.txt:

2 0.178571 0.161818 0.173469 0.309091
2 0.734694 0.252727 0.146939 0.294545

Reference

Reference: https://github.com/AlexeyAB/Yolo_mark/issues/60 ↩︎