<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Region-Based Detector | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/region-based-detector/</link><atom:link href="https://haobin-tan.netlify.app/tags/region-based-detector/index.xml" rel="self" type="application/rss+xml"/><description>Region-Based Detector</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sat, 20 Feb 2021 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Region-Based Detector</title><link>https://haobin-tan.netlify.app/tags/region-based-detector/</link></image><item><title>Overview of Region-based Object Detectors</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/overview-region-based-detectors/</link><pubDate>Sat, 20 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/overview-region-based-detectors/</guid><description>&lt;h2 id="sliding-window-detectors">Sliding-window detectors&lt;/h2>
&lt;p>A brute force approach for object detection is to &lt;strong>slide windows from left and right, and from up to down&lt;/strong> to identify objects using classification. To detect different object types at various viewing distances, we use windows of varied sizes and aspect ratios.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*-GaZ8hGBKsbtGfRJqvOVHQ.jpeg" alt="Image for post">&lt;/p>
&lt;p>We cut out patches from the picture according to the sliding windows. The patches are warped since many classifiers take fixed size images only.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*A7DE4HKukbXpQqwvCaLOEQ.jpeg">&lt;figcaption>
&lt;h4>Warp an image to a fixed size image&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>The warped image patch is fed into a CNN classifier to extract 4096 features. Then we apply a SVM classifier to identify the class and another linear regressor for the boundary box.&lt;/p>
&lt;p>System flow:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*BYSA3iip3Cdr0L_x5r468A.png" alt="Image for post">&lt;/p>
&lt;p>Pseudo-code:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">window&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">windows&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">get_patch&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">window&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">results&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We create many windows to detect different object shapes at different locations. To improve performance, one obvious solution is to &lt;strong>reduce the number of &lt;em>windows&lt;/em>&lt;/strong>.&lt;/p>
&lt;h2 id="selective-search">Selective Search&lt;/h2>
&lt;p>Instead of a brute force approach, we use a region proposal method to create &lt;strong>regions of interest (ROIs)&lt;/strong> for object detection.&lt;/p>
&lt;p>In &lt;strong>selective search&lt;/strong> (&lt;strong>SS&lt;/strong>)&lt;/p>
&lt;ol>
&lt;li>We start with each individual pixel as its own group&lt;/li>
&lt;li>We calculate the texture for each group and combine two that are the closest ( to avoid a single region in gobbling others, we prefer grouping smaller group first).&lt;/li>
&lt;li>We continue merging regions until everything is combined together.&lt;/li>
&lt;/ol>
&lt;p>The figure below illustrates this process:&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/700/1*_8BNWWwyod1LWUdzcAUr8w.png">&lt;figcaption>
&lt;h4>In the first row, we show how we grow the regions, and the blue rectangles in the second rows show all possible ROIs we made during the merging.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;h2 id="r-cnn-1">R-CNN &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>&lt;strong>Region-based Convolutional Neural Networks (R-CNN )&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>Uses of a region proposal method to create about 2000 &lt;strong>ROI&lt;/strong>s (regions of interest).&lt;/li>
&lt;li>The regions are warped into fixed size images and feed into a CNN network individually.&lt;/li>
&lt;li>Uses fully connected layers to classify the object and to refine the boundary box.&lt;/li>
&lt;/ol>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*Wmw21tBUez37bj-1ws7XEw.jpeg">&lt;figcaption>
&lt;h4>R-CNN uses **regional proposals**, **CNN**, **FC layers** to locate objects.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-21%2022.08.28.png" alt="截屏2021-02-21 22.08.28">&lt;/p>
&lt;p>&lt;strong>System flow&lt;/strong>:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*ciyhZpgEvxDm1YxZd1SJWg.png" alt="Image for post">&lt;/p>
&lt;p>&lt;strong>Pseudo-code&lt;/strong>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># RoI from a proposal method (~2k)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">get_patch&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">results&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>With far fewer but higher quality ROIs, R-CNN run faster and more accurate than the sliding windows. However, R-CNN is still very slow, because it need to do about 2k independent forward passes for each image! 🤪&lt;/p>
&lt;h2 id="fast-r-cnn-2">Fast R-CNN &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>How does Fast R-CNN work?&lt;/p>
&lt;ul>
&lt;li>Instead of extracting features for each image patch from scratch, we use a &lt;strong>feature extractor&lt;/strong> (a CNN) to extract features for the whole image first.&lt;/li>
&lt;li>We also use an &lt;strong>external region proposal method&lt;/strong>, like the selective search, to create ROIs which later combine with the corresponding feature maps to form patches for object detection.&lt;/li>
&lt;li>We warp the patches to a fixed size using &lt;strong>ROI pooling&lt;/strong> and feed them to fully connected layers for classification and &lt;strong>localization&lt;/strong> (detecting the location of the object).&lt;/li>
&lt;/ul>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*Dd3-sugNKInTIv12u8cWkw.jpeg">&lt;figcaption>
&lt;h4>Fast R-CNN apply region proposal **on feature maps** and form fixed size patches using **ROI pooling**.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-21%2022.40.17.png">&lt;figcaption>
&lt;h4>Fast R-CNN vs. R-CNN &lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>&lt;strong>System flow&lt;/strong>:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*fLMNHfe_QFxW569s4eR7Dg.jpeg" alt="Image for post">&lt;/p>
&lt;p>&lt;strong>Pseudo-code&lt;/strong>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">feature_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">process&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">roi_pooling&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">results&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>The expensive feature extraction is moving out of the for-loop. This is a significant speed improvement since it was executed for all 2000 ROIs. &amp;#x1f44f;&lt;/li>
&lt;/ul>
&lt;p>One major takeaway for Fast R-CNN is that the whole network (the feature extractor, the classifier, and the boundary box regressor) are trained end-to-end with &lt;strong>multi-task losses&lt;/strong> (classification loss and localization loss). This improves accuracy.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/20180502185247910.png" alt="img">&lt;/p>
&lt;h3 id="roi-pooling">ROI pooling&lt;/h3>
&lt;p>Because Fast R-CNN uses fully connected layers, we apply &lt;strong>ROI pooling&lt;/strong> to warp the variable size ROIs into in a predefined fix size shape.&lt;/p>
&lt;p>&lt;em>Let&amp;rsquo;s take a look at a simple example: transforming 8 × 8 feature maps into a predefined 2 × 2 shape.&lt;/em>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*LLP4tKGsYGgAx3uPfmGdsw.png" alt="Image for post" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>Top left: feature maps&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Top right: we overlap the ROI (blue) with the feature maps.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Bottom left: we split ROIs into the target dimension. For example, with our 2×2 target, we split the ROIs into 4 sections with similar or equal sizes.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Bottom right: find the &lt;strong>maximum&lt;/strong> for each section (i.e, max-pool within each section) and the result is our warped feature maps.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Now we get a 2 × 2 feature patch that we can feed into the classifier and box regressor.&lt;/p>
&lt;p>&lt;em>Another gif example&lt;/em>:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*5V5mycIRNu-mK-rPywL57w.gif" alt="Image for post">&lt;/p>
&lt;h3 id="problems-of-fast-r-cnn">Problems of Fast R-CNN&lt;/h3>
&lt;p>Fast R-CNN depends on an external region proposal method like selective search. &lt;strong>However, those algorithms run on CPU and they are slow&lt;/strong>. In testing, Fast R-CNN takes 2.3 seconds to make a prediction in which 2 seconds are for generating 2000 ROIs!!!&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">feature_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">process&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># Expensive!&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">roi_pooling&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">results&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="faster-r-cnn-3-make-cnn-do-proposals">&lt;strong>Faster R-CNN&lt;/strong> &lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>: Make CNN do proposals&lt;/h2>
&lt;p>Faster R-CNN adopts similar design as the Fast R-CNN &lt;strong>except&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>it replaces the region proposal method by an internal deep network called Region Proposal Network (RPN)&lt;/strong>&lt;/li>
&lt;li>&lt;strong>the ROIs are derived from the feature maps instead&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;p>System flow: (same as Fast R-CNN)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*F-WbcUMpWSE1tdKRgew2Ug.png" alt="Image for post">&lt;/p>
&lt;p>The network flow is similar but the region proposal is now replaced by a internal convolutional network, Region Proposal Network (RPN).&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*0cxB2pAxQ0A7AhTl-YT2JQ.jpeg">&lt;figcaption>
&lt;h4>The external region proposal is replaced by an internal Region Proposal Network (RPN).&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*JQfhkHK6V8NRuh-97Pg4lQ.png" alt="Image for post" style="zoom:80%;" />
&lt;p>&lt;strong>Pseudo-code&lt;/strong>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">feature_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">process&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># use RPN&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">roi_pooling&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_scores&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">box&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_probabilities&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">softmax&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">class_scores&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="region-proposal-network-rpn">Region proposal network (RPN)&lt;/h3>
&lt;p>The region proposal network (&lt;strong>RPN&lt;/strong>)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>takes the output feature maps from the first convolutional network as input&lt;/p>
&lt;/li>
&lt;li>
&lt;p>slides 3 × 3 filters over the feature maps to make class-agnostic region proposals using a convolutional network like ZF network&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/1000/1*z0OHn89t0bOIHwoIOwNDtg.jpeg">&lt;figcaption>
&lt;h4>ZF network&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>Other deep network likes VGG or ResNet can be used for more comprehensive feature extraction at the cost of speed.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The ZF network outputs 256 values, which is feed into 2 separate fully connected (FC) layers to predict a boundary box and 2 objectness scores.&lt;/p>
&lt;ul>
&lt;li>The &lt;strong>objectness&lt;/strong> measures whether the box contains an object. We can use a regressor to compute a single objectness score but for simplicity, Faster R-CNN uses a classifier with 2 possible classes: one for the “have an object” category and one without (i.e. the background class).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>For each location in the feature maps, RPN makes &lt;strong>$k$&lt;/strong> guesses&lt;/p>
&lt;p>$\Rightarrow$ RPN outputs $4 \times k$ coordinates (top-left and bottom-right $(x, y)$ coordinates) for bounding box and $2 \times k$ scores for objectness (with vs. without object) per location&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Example: $8 \times 8$ feature maps with a $3 \times 3$ filter, and it outputs a total of $8 \times 8 \times 3$ ROIs (for $k = 3$)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*smu6PiCx4LaPwGIo3HG0GQ.jpeg" alt="Image for post">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Here we get 3 guesses and we will refine our guesses later. Since we just need one to be correct, we will be better off if our initial guesses have different shapes and size.&lt;/p>
&lt;p>Therefore, Faster R-CNN does not make random boundary box proposals. Instead, it predicts offsets like $\delta\_x, \delta\_y$ that are relative to the top left corner of some reference boxes called &lt;strong>anchors&lt;/strong>. We constraints the value of those offsets so our guesses still resemble the anchors.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*yF_FrZAkXA3XKFA-sf7XZw.png" alt="Image for post">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>To make $k$ predictions per location, we need $k$ anchors centered at each location. Each prediction is associated with a specific anchor but different locations share the &lt;strong>same&lt;/strong> anchor shapes.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*RJoauxGwUTF17ZANQmL8jw.png" alt="Image for post">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Those anchors are carefully pre-selected so they are diverse and cover real-life objects at different scales and aspect ratios reasonable well.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>This guides the initial training with better guesses and allows each prediction to specialize in a certain shape. This strategy makes early training more stable and easier. 👍&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Faster R-CNN uses far more anchors. It deploys 9 anchor boxes: &lt;strong>3 different scales at 3 different aspect ratio.&lt;/strong> Using 9 anchors per location, it generates 2 × 9 objectness scores and 4 × 9 coordinates per location.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*PszFnq3rqa_CAhBrI94Eeg.png" alt="Image for post">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;strong>Anchors&lt;/strong> are also called &lt;strong>priors&lt;/strong> or &lt;strong>default boundary boxes&lt;/strong> in different papers.&lt;/span>
&lt;/div>
&lt;details>
&lt;summary>Nice example and explanation from Stanford cs231n slide&lt;/summary>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-22%2022.04.55.png" alt="截屏2021-02-22 22.04.55">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-22%2022.09.09.png" alt="截屏2021-02-22 22.09.09">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-22%2022.05.14.png" alt="截屏2021-02-22 22.05.14">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-22%2022.05.22.png" alt="截屏2021-02-22 22.05.22">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-22%2022.05.32.png" alt="截屏2021-02-22 22.05.32">&lt;/p>
&lt;/details>
&lt;h2 id="region-based-fully-convolutional-networks-r-fcn-4">Region-based Fully Convolutional Networks (R-FCN) &lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="-idea">💡 Idea&lt;/h3>
&lt;p>Let’s assume we only have a feature map detecting the right eye of a face. Can we use it to locate a face? It should. Since the right eye should be on the top-left corner of a facial picture, we can use that to locate the face.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*gqxSBKVla8dzwADKgADpWg-20210222160628867.jpeg" alt="Image for post">&lt;/p>
&lt;p>If we have other feature maps specialized in detecting the left eye, the nose or the mouth, we can combine the results together to locate the face better.&lt;/p>
&lt;h3 id="problem-of-faster-r-cnn">Problem of Faster R-CNN&lt;/h3>
&lt;p>In Faster R-CNN, the &lt;em>detector&lt;/em> applies multiple fully connected layers to make predictions. With 2,000 ROIs, it can be expensive.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">feature_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">process&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">roi_pooling&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_scores&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">box&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># Expensive!&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_probabilities&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">softmax&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">class_scores&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="r-fcn-reduce-the-amount-of-work-needed-for-each-roi">R-FCN: reduce the amount of work needed for each ROI&lt;/h3>
&lt;p>R-FCN improves speed by &lt;strong>reducing the amount of work needed for each ROI.&lt;/strong> The region-based feature maps above are independent of ROIs and can be computed outside each ROI. The remaining work is then much simpler and therefore R-FCN is faster than Faster R-CNN.&lt;/p>
&lt;p>&lt;strong>Pseudo-code&lt;/strong>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">feature_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">process&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">score_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">compute_score_map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_roi_pool&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">score_maps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_scores&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">box&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">average&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">V&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># Much simpler!&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_probabilities&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">softmax&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">class_scores&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="position-sensitive-score-mapping">&lt;strong>Position-sensitive score mapping&lt;/strong>&lt;/h3>
&lt;p>Let’s consider a 5 × 5 feature map &lt;strong>M&lt;/strong> with a blue square object inside. We divide the square object equally into 3 × 3 regions.&lt;/p>
&lt;p>Now, we create a new feature map from M to detect the top left (TL) corner of the square only. The new feature map looks like the one on the right below. &lt;strong>Only the yellow grid cell [2, 2] is activated.&lt;/strong>&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/700/1*S0enLblW1t7VK19E1Fs4lw.png">&lt;figcaption>
&lt;h4>Create a new feature map from the left to detect the top left corner of an object.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>Since we divide the square into 9 parts, we can create 9 feature maps each detecting the corresponding region of the object. These feature maps are called &lt;strong>position-sensitive score maps&lt;/strong> because each map detects (scores) a sub-region of the object.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*HaOHsDYAf8LU2YQ7D3ymOg.png" alt="Image for post">&lt;/p>
&lt;p>Let’s say the dotted red rectangle below is the ROI proposed. We divide it into 3 × 3 regions and ask &lt;strong>how likely each region contains the corresponding part of the object&lt;/strong>.&lt;/p>
&lt;p>For example, how likely the top-left ROI region contains the left eye. We store the results into a 3 × 3 vote array in the right diagram below. For example, &lt;code>vote_array[0][0]&lt;/code> contains the score on whether we find the top-left region of the square object.&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/700/1*Ym6b1qS0pXpeRVMysvvukg.jpeg">&lt;figcaption>
&lt;h4>Apply ROI onto the feature maps to output a 3 x 3 array.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>This process to map score maps and ROIs to the vote array is called &lt;strong>position-sensitive&lt;/strong> &lt;strong>ROI-pool&lt;/strong>.&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/700/1*K4brSqensF8wL5i6JV1Eig.png">&lt;figcaption>
&lt;h4>Overlay a portion of the ROI onto the corresponding score map to calculate `V[i][j]`&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>After calculating all the values for the position-sensitive ROI pool, &lt;strong>the class score is the average of all its elements.&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*ZJiWcIl2DUyx1-ZqArw33A.png" alt="Image for post" style="zoom:80%;" />
&lt;h3 id="data-flow">Data flow&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Let’s say we have &lt;strong>$C$&lt;/strong> classes to detect.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>We expand it to $C + 1$ classes so we include a new class for the background (non-object). Each class will have its own $3 \times 3$ score maps and therefore a total of $(C+1) \times 3 \times 3$ score maps.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Using its own set of score maps, we predict a class score for each class.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Then we apply a softmax on those scores to compute the probability for each class.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;figure>&lt;img src="https://miro.medium.com/max/1000/1*Gv45peeSM2wRQEdaLG_YoQ.png">&lt;figcaption>
&lt;h4>Data flow of R-FCN ($k=3$)&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://jonathan-hui.medium.com/what-do-we-learn-from-region-based-object-detectors-faster-r-cnn-r-fcn-fpn-7e354377a7c9">What do we learn from region based object detectors (Faster R-CNN, R-FCN, FPN)?&lt;/a> - A nice and clear comprehensive tutorial for region-based object detectors&lt;/li>
&lt;li>&lt;a href="http://cs231n.stanford.edu/slides/2020/lecture_12.pdf">Stanford CS231n slides&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://medium.com/cubo-ai/%E7%89%A9%E9%AB%94%E5%81%B5%E6%B8%AC-object-detection-740096ec4540">關於影像辨識，所有你應該知道的深度學習模型&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://blog.csdn.net/v_JULY_v/article/details/80170182">一文读懂目标检测：R-CNN、Fast R-CNN、Faster R-CNN、YOLO、SSD&lt;/a>&lt;/li>
&lt;li>RoI pooling: &lt;a href="https://towardsdatascience.com/understanding-region-of-interest-part-1-roi-pooling-e4f5dd65bb44">Understanding Region of Interest — (RoI Pooling)&lt;/a>&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Girshick, R., Donahue, J., Darrell, T., &amp;amp; Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. &lt;em>Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition&lt;/em>, 580–587. &lt;a href="https://doi.org/10.1109/CVPR.2014.81">https://doi.org/10.1109/CVPR.2014.81&lt;/a>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Girshick, R. (2015). Fast R-CNN. &lt;em>Proceedings of the IEEE International Conference on Computer Vision&lt;/em>, &lt;em>2015 International Conference on Computer Vision&lt;/em>, &lt;em>ICCV 2015&lt;/em>, 1440–1448. &lt;a href="https://doi.org/10.1109/ICCV.2015.169">https://doi.org/10.1109/ICCV.2015.169&lt;/a>&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>Ren, S., He, K., Girshick, R., &amp;amp; Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. &lt;em>IEEE Transactions on Pattern Analysis and Machine Intelligence&lt;/em>, &lt;em>39&lt;/em>(6), 1137–1149. &lt;a href="https://doi.org/10.1109/TPAMI.2016.2577031">https://doi.org/10.1109/TPAMI.2016.2577031&lt;/a>&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4">
&lt;p>Dai, J., Li, Y., He, K., &amp;amp; Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. &lt;em>Advances in Neural Information Processing Systems&lt;/em>, 379–387.&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item></channel></rss>