Overview of Region-based Object Detectors

Sliding-window detectors

A brute force approach for object detection is to slide windows from left and right, and from up to down to identify objects using classification. To detect different object types at various viewing distances, we use windows of varied sizes and aspect ratios.

Image for post

We cut out patches from the picture according to the sliding windows. The patches are warped since many classifiers take fixed size images only.

Warp an image to a fixed size image
Warp an image to a fixed size image

The warped image patch is fed into a CNN classifier to extract 4096 features. Then we apply a SVM classifier to identify the class and another linear regressor for the boundary box.

System flow:

Image for post

Pseudo-code:

for window in windows:
    patch = get_patch(image, window)
    results = detector(patch)

We create many windows to detect different object shapes at different locations. To improve performance, one obvious solution is to reduce the number of windows.

Instead of a brute force approach, we use a region proposal method to create regions of interest (ROIs) for object detection.

In selective search (SS)

  1. We start with each individual pixel as its own group
  2. We calculate the texture for each group and combine two that are the closest ( to avoid a single region in gobbling others, we prefer grouping smaller group first).
  3. We continue merging regions until everything is combined together.

The figure below illustrates this process:

In the first row, we show how we grow the regions, and the blue rectangles in the second rows show all possible ROIs we made during the merging.
In the first row, we show how we grow the regions, and the blue rectangles in the second rows show all possible ROIs we made during the merging.

R-CNN 1

Region-based Convolutional Neural Networks (R-CNN )

  1. Uses of a region proposal method to create about 2000 ROIs (regions of interest).
  2. The regions are warped into fixed size images and feed into a CNN network individually.
  3. Uses fully connected layers to classify the object and to refine the boundary box.
R-CNN uses **regional proposals**, **CNN**, **FC layers** to locate objects.
R-CNN uses regional proposals, CNN, FC layers to locate objects.

截屏2021-02-21 22.08.28

System flow:

Image for post

Pseudo-code:

ROIs = region_proposal(image) # RoI from a proposal method (~2k)
for ROI in ROIs:
    patch = get_patch(image, ROI)
    results = detector(patch)

With far fewer but higher quality ROIs, R-CNN run faster and more accurate than the sliding windows. However, R-CNN is still very slow, because it need to do about 2k independent forward passes for each image! 🤪

Fast R-CNN 2

How does Fast R-CNN work?

  • Instead of extracting features for each image patch from scratch, we use a feature extractor (a CNN) to extract features for the whole image first.
  • We also use an external region proposal method, like the selective search, to create ROIs which later combine with the corresponding feature maps to form patches for object detection.
  • We warp the patches to a fixed size using ROI pooling and feed them to fully connected layers for classification and localization (detecting the location of the object).
Fast R-CNN apply region proposal **on feature maps** and form fixed size patches using **ROI pooling**.
Fast R-CNN apply region proposal on feature maps and form fixed size patches using ROI pooling.
Fast R-CNN vs. R-CNN
Fast R-CNN vs. R-CNN

System flow:

Image for post

Pseudo-code:

feature_maps = process(image)
ROIs = region_proposal(image)
for ROI in ROIs:
    patch = roi_pooling(feature_maps, ROI)
    results = detector2(patch)
  • The expensive feature extraction is moving out of the for-loop. This is a significant speed improvement since it was executed for all 2000 ROIs. 👏

One major takeaway for Fast R-CNN is that the whole network (the feature extractor, the classifier, and the boundary box regressor) are trained end-to-end with multi-task losses (classification loss and localization loss). This improves accuracy.

img

ROI pooling

Because Fast R-CNN uses fully connected layers, we apply ROI pooling to warp the variable size ROIs into in a predefined fix size shape.

Let’s take a look at a simple example: transforming 8 × 8 feature maps into a predefined 2 × 2 shape.

Image for post
  • Top left: feature maps

  • Top right: we overlap the ROI (blue) with the feature maps.

  • Bottom left: we split ROIs into the target dimension. For example, with our 2×2 target, we split the ROIs into 4 sections with similar or equal sizes.

  • Bottom right: find the maximum for each section (i.e, max-pool within each section) and the result is our warped feature maps.

Now we get a 2 × 2 feature patch that we can feed into the classifier and box regressor.

Another gif example:

Image for post

Problems of Fast R-CNN

Fast R-CNN depends on an external region proposal method like selective search. However, those algorithms run on CPU and they are slow. In testing, Fast R-CNN takes 2.3 seconds to make a prediction in which 2 seconds are for generating 2000 ROIs!!!

feature_maps = process(image)
ROIs = region_proposal(image) # Expensive!
for ROI in ROIs:
    patch = roi_pooling(feature_maps, ROI)
    results = detector2(patch)

Faster R-CNN 3: Make CNN do proposals

Faster R-CNN adopts similar design as the Fast R-CNN except

  • it replaces the region proposal method by an internal deep network called Region Proposal Network (RPN)
  • the ROIs are derived from the feature maps instead.

System flow: (same as Fast R-CNN)

Image for post

The network flow is similar but the region proposal is now replaced by a internal convolutional network, Region Proposal Network (RPN).

The external region proposal is replaced by an internal Region Proposal Network (RPN).
The external region proposal is replaced by an internal Region Proposal Network (RPN).
Image for post

Pseudo-code:

feature_maps = process(image)
ROIs = region_proposal(feature_maps) # use RPN
for ROI in ROIs:
    patch = roi_pooling(feature_maps, ROI)
    class_scores, box = detector(patch)
    class_probabilities = softmax(class_scores)

Region proposal network (RPN)

The region proposal network (RPN)

  • takes the output feature maps from the first convolutional network as input

  • slides 3 × 3 filters over the feature maps to make class-agnostic region proposals using a convolutional network like ZF network

    ZF network
    ZF network

    Other deep network likes VGG or ResNet can be used for more comprehensive feature extraction at the cost of speed.

  • The ZF network outputs 256 values, which is feed into 2 separate fully connected (FC) layers to predict a boundary box and 2 objectness scores.

    • The objectness measures whether the box contains an object. We can use a regressor to compute a single objectness score but for simplicity, Faster R-CNN uses a classifier with 2 possible classes: one for the “have an object” category and one without (i.e. the background class).
  • For each location in the feature maps, RPN makes $k$ guesses

    $\Rightarrow$ RPN outputs $4 \times k$ coordinates (top-left and bottom-right $(x, y)$ coordinates) for bounding box and $2 \times k$ scores for objectness (with vs. without object) per location

    • Example: $8 \times 8$ feature maps with a $3 \times 3$ filter, and it outputs a total of $8 \times 8 \times 3$ ROIs (for $k = 3$)

      Image for post

      • Here we get 3 guesses and we will refine our guesses later. Since we just need one to be correct, we will be better off if our initial guesses have different shapes and size.

        Therefore, Faster R-CNN does not make random boundary box proposals. Instead, it predicts offsets like $\delta_x, \delta_y$ that are relative to the top left corner of some reference boxes called anchors. We constraints the value of those offsets so our guesses still resemble the anchors.

        Image for post

      • To make $k$ predictions per location, we need $k$ anchors centered at each location. Each prediction is associated with a specific anchor but different locations share the same anchor shapes.

        Image for post

      • Those anchors are carefully pre-selected so they are diverse and cover real-life objects at different scales and aspect ratios reasonable well.

        • This guides the initial training with better guesses and allows each prediction to specialize in a certain shape. This strategy makes early training more stable and easier. 👍
  • Faster R-CNN uses far more anchors. It deploys 9 anchor boxes: 3 different scales at 3 different aspect ratio. Using 9 anchors per location, it generates 2 × 9 objectness scores and 4 × 9 coordinates per location.

    Image for post

Anchors are also called priors or default boundary boxes in different papers.
Nice example and explanation from Stanford cs231n slide

截屏2021-02-22 22.04.55

截屏2021-02-22 22.09.09

截屏2021-02-22 22.05.14

截屏2021-02-22 22.05.22

截屏2021-02-22 22.05.32

Region-based Fully Convolutional Networks (R-FCN) 4

💡 Idea

Let’s assume we only have a feature map detecting the right eye of a face. Can we use it to locate a face? It should. Since the right eye should be on the top-left corner of a facial picture, we can use that to locate the face.

Image for post

If we have other feature maps specialized in detecting the left eye, the nose or the mouth, we can combine the results together to locate the face better.

Problem of Faster R-CNN

In Faster R-CNN, the detector applies multiple fully connected layers to make predictions. With 2,000 ROIs, it can be expensive.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
for ROI in ROIs
    patch = roi_pooling(feature_maps, ROI)
    class_scores, box = detector(patch) # Expensive!
    class_probabilities = softmax(class_scores)

R-FCN: reduce the amount of work needed for each ROI

R-FCN improves speed by reducing the amount of work needed for each ROI. The region-based feature maps above are independent of ROIs and can be computed outside each ROI. The remaining work is then much simpler and therefore R-FCN is faster than Faster R-CNN.

Pseudo-code:

feature_maps = process(image)
ROIs = region_proposal(feature_maps)         
score_maps = compute_score_map(feature_maps)
for ROI in ROIs:
    V = region_roi_pool(score_maps, ROI)     
    class_scores, box = average(V)                   # Much simpler!
    class_probabilities = softmax(class_scores)

Position-sensitive score mapping

Let’s consider a 5 × 5 feature map M with a blue square object inside. We divide the square object equally into 3 × 3 regions.

Now, we create a new feature map from M to detect the top left (TL) corner of the square only. The new feature map looks like the one on the right below. Only the yellow grid cell [2, 2] is activated.

Create a new feature map from the left to detect the top left corner of an object.
Create a new feature map from the left to detect the top left corner of an object.

Since we divide the square into 9 parts, we can create 9 feature maps each detecting the corresponding region of the object. These feature maps are called position-sensitive score maps because each map detects (scores) a sub-region of the object.

Image for post

Let’s say the dotted red rectangle below is the ROI proposed. We divide it into 3 × 3 regions and ask how likely each region contains the corresponding part of the object.

For example, how likely the top-left ROI region contains the left eye. We store the results into a 3 × 3 vote array in the right diagram below. For example, vote_array[0][0] contains the score on whether we find the top-left region of the square object.

Apply ROI onto the feature maps to output a 3 x 3 array.
Apply ROI onto the feature maps to output a 3 x 3 array.

This process to map score maps and ROIs to the vote array is called position-sensitive ROI-pool.

Overlay a portion of the ROI onto the corresponding score map to calculate `V[i][j]`
Overlay a portion of the ROI onto the corresponding score map to calculate V[i][j]

After calculating all the values for the position-sensitive ROI pool, the class score is the average of all its elements.

Image for post

Data flow

  • Let’s say we have $C$ classes to detect.

  • We expand it to $C + 1$ classes so we include a new class for the background (non-object). Each class will have its own $3 \times 3$ score maps and therefore a total of $(C+1) \times 3 \times 3$ score maps.

  • Using its own set of score maps, we predict a class score for each class.

  • Then we apply a softmax on those scores to compute the probability for each class.

Data flow of R-FCN ($k=3$)
Data flow of R-FCN ($k=3$)

Reference


  1. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 580–587. https://doi.org/10.1109/CVPR.2014.81 ↩︎

  2. Girshick, R. (2015). Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision, 2015 International Conference on Computer Vision, ICCV 2015, 1440–1448. https://doi.org/10.1109/ICCV.2015.169 ↩︎

  3. Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031 ↩︎

  4. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. Advances in Neural Information Processing Systems, 379–387. ↩︎

Previous