<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Object Detection | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/object-detection/</link><atom:link href="https://haobin-tan.netlify.app/tags/object-detection/index.xml" rel="self" type="application/rss+xml"/><description>Object Detection</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sat, 20 Feb 2021 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Object Detection</title><link>https://haobin-tan.netlify.app/tags/object-detection/</link></image><item><title>Object Detection</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/</link><pubDate>Thu, 12 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/</guid><description/></item><item><title>Evaluation Metrics for Object Detection</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/evaluation-metrics-object-detection/</link><pubDate>Thu, 12 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/evaluation-metrics-object-detection/</guid><description>&lt;h2 id="precision--recall">Precision &amp;amp; Recall&lt;/h2>
&lt;p>Confusion matrix:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*CPnO_bcdbE8FXTejQiV2dg.png" alt="Image result for true positive false positive">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Precision&lt;/strong>: measures how accurate is your predictions. i.e. the percentage of your predictions are correct.
&lt;/p>
$$
\text{precision} = \frac{TP}{TP + FP}
$$
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Recall&lt;/strong>: measures how good you find all the positives.
&lt;/p>
$$
\text{recall} = \frac{TP}{TP + FN}
$$
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>More see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/machine-learning/ml-fundamentals/evaluation/">Evaluation&lt;/a>&lt;/p>
&lt;/blockquote>
&lt;h2 id="iou-intersection-over-union">IoU (Intersection over union)&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>IoU measures the overlap between 2 boundaries.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/0*VnvOCo9NkWG705F3.png" alt="Image for post" style="zoom:67%;" />
&lt;/li>
&lt;li>
&lt;p>We use that to measure how much our predicted boundary overlaps with the ground truth (the real object boundary).&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*FrmKLxCtkokDC3Yr1wc70w.png" alt="Image for post" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;h2 id="ap-average-precision">AP (Average Precision)&lt;/h2>
&lt;p>Let’s create an over-simplified example in demonstrating the calculation of the average precision. In this example, the whole dataset contains 5 apples only.&lt;/p>
&lt;p>We collect all the predictions made for apples in all the images and rank it in descending order according to the predicted confidence level. The second column indicates whether the prediction is correct or not. In this example, the prediction is correct if IoU ≥ 0.5.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*9ordwhXD68cKCGzuJaH2Rg.png" alt="Image for post">&lt;/p>
&lt;p>Let&amp;rsquo;s look at the 3rd row:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Precision&lt;/strong>: proportion of TP (= 2/3 = 0.67)&lt;/li>
&lt;li>&lt;strong>Recall&lt;/strong>: proportion of TP out of the possible positives (= 2/5 = 0.4)&lt;/li>
&lt;/ul>
&lt;p>Recall values increase as we go down the prediction ranking. However, precision has a &lt;em>zigzag&lt;/em> pattern — it goes down with false positives and goes up again with true positives.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*ODZ6eZMrie3XVTOMDnXTNQ.jpeg" alt="Image for post">&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*VenTq4IgxjmIpOXWdFb-jg.png" alt="Image for post" />
&lt;p>The general definition for the &lt;strong>Average Precision (AP)&lt;/strong> is finding the &lt;strong>area under the precision-recall curve&lt;/strong> above.
&lt;/p>
$$
\mathrm{AP}=\int\_{0}^{1} p(r) d r
$$
&lt;h3 id="smoothing-the-precision-recall-curve">Smoothing the Precision-Recall-Curve&lt;/h3>
&lt;p>Before calculating AP for the object detection, we often &lt;strong>smooth&lt;/strong> out the zigzag pattern first: at each recall level, we replace each precision value with the maximum precision value to the right of that recall level.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*pmSxeb4EfdGnzT6Xa68GEQ.jpeg" alt="Image for post">&lt;/p>
&lt;p>The orange line is transformed into the green lines and the curve will decrease monotonically instead of the zigzag pattern.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Before smoothing&lt;/th>
&lt;th>After smoothing&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*VenTq4IgxjmIpOXWdFb-jg.png" alt="Image for post" style="zoom: 67%;" />&lt;/td>
&lt;td>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*zqTL1KW1gwzion9jY8SjHA-20201112121009694.png" alt="Image for post" style="zoom:67%;" />&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Mathematically, we replace the precision value for recall $\tilde{r}$ with the maximum precision for any recall $\geq \tilde{r}$.
&lt;/p>
$$
p\_{\text {interp}}(r)=\max\_{\tilde{r} \geq r} p(\tilde{r})
$$
&lt;h3 id="interpolated-ap">&lt;strong>Interpolated AP&lt;/strong>&lt;/h3>
&lt;p>PASCAL VOC is a popular dataset for object detection. In Pascal VOC2008, an average for the 11-point interpolated AP is calculated.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*naz02wO-XMywlwAdFzF-GA.jpeg" alt="Image for post">&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Divide the recall value from 0 to 1.0 into 11 points — 0, 0.1, 0.2, …, 0.9 and 1.0.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compute the average of maximum precision value for these 11 recall values.
&lt;/p>
$$
\begin{aligned}
A P &amp;=\frac{1}{11} \sum\_{r \in\\{0.0, \ldots, 1.0\\}} A P\_{r} \\\\
&amp;=\frac{1}{11} \sum\_{r \in\\{0.0, \ldots, 1.0\\}} p\_{\text {interp}}(r)
\end{aligned}
$$
&lt;ul>
&lt;li>
&lt;p>In our example:&lt;/p>
&lt;p>$AP = \frac{1}{11} \times (5 \times 1.0 + 4 \times 0.57 + 2 \times 0.5)$&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>However, this interpolated method is an approximation which suffers two issues.&lt;/p>
&lt;ul>
&lt;li>It is less precise.&lt;/li>
&lt;li>It lost the capability in measuring the difference for methods with low AP.&lt;/li>
&lt;/ul>
&lt;p>Therefore, a different AP calculation is adopted after 2008 for PASCAL VOC.&lt;/p>
&lt;h3 id="ap-area-under-curve-auc">AP (Area under curve AUC)&lt;/h3>
&lt;p>For later Pascal VOC competitions, VOC2010–2012 samples the curve at all unique recall values (&lt;em>r₁, r₂, …&lt;/em>), whenever the maximum precision value drops. With this change, we are &lt;strong>measuring the exact area under the precision-recall curve&lt;/strong> after the zigzags are removed.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*TAuQ3UOA8xh_5wI5hwLHcg.jpeg" alt="Image for post">&lt;/p>
&lt;p>No approximation or interpolation is needed &amp;#x1f44f;. Instead of sampling 11 points, &lt;strong>we sample $p(r\_i)$ whenever it drops and computes AP as the sum of the rectangular blocks&lt;/strong>.
&lt;/p>
$$
\begin{array}{l}
p\_{\text {interp}}\left(r\_{n+1}\right)=\displaystyle{\max\_{\tilde{r} \geq r\_{n+1}}} p(\tilde{r}) \\\\
\mathrm{AP}=\sum\left(r\_{n+1}-r\_{n}\right) p\_{\text {interp}}\left(r\_{n+1}\right)
\end{array}
$$
&lt;p>
This definition is called the &lt;strong>Area Under Curve (AUC)&lt;/strong>.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173">&lt;strong>mAP (mean Average Precision) for Object Detection&lt;/strong>&lt;/a> &amp;#x1f44d;&lt;/li>
&lt;li>&lt;a href="https://medium.com/@yanfengliux/the-confusing-metrics-of-ap-and-map-for-object-detection-3113ba0386ef">The Confusing Metrics of AP and mAP for Object Detection / Instance Segmentation&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://zhuanlan.zhihu.com/p/56961620">详解object detection中的mAP&lt;/a>​ &amp;#x1f44d;&lt;/li>
&lt;/ul></description></item><item><title>COCO JSON Format for Object Detection</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/coco-dataset-format/</link><pubDate>Wed, 02 Dec 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/coco-dataset-format/</guid><description>&lt;p>The COCO dataset is formatted in &lt;a href="https://www.w3schools.com/js/js_json_syntax.asp">JSON&lt;/a> and is a collection of “info”, “licenses”, “images”, “annotations”, “categories” (in most cases), and “segment info” (in one case).&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;info&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">{&lt;/span>&lt;span class="err">...&lt;/span>&lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;licenses&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="err">...&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;images&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="err">...&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;annotations&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="err">...&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;categories&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="err">...&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="err">&amp;lt;--&lt;/span> &lt;span class="err">Not&lt;/span> &lt;span class="err">in&lt;/span> &lt;span class="err">Captions&lt;/span> &lt;span class="err">annotations&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;segment_info&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="err">...&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="err">&amp;lt;--&lt;/span> &lt;span class="err">Only&lt;/span> &lt;span class="err">in&lt;/span> &lt;span class="err">Panoptic&lt;/span> &lt;span class="err">annotations&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Note:&lt;/p>
&lt;ul>
&lt;li>&lt;code>categories&lt;/code> field is NOT in Captions annotations&lt;/li>
&lt;li>&lt;code>segment_info&lt;/code> field is ONLY in Panoptic annotations&lt;/li>
&lt;/ul>
&lt;h2 id="info">Info&lt;/h2>
&lt;p>The “info” section contains &lt;strong>high level information&lt;/strong> about the dataset. If you are creating your own dataset, you can fill in whatever is appropriate.&lt;/p>
&lt;p>Example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="s2">&amp;#34;info&amp;#34;&lt;/span>&lt;span class="err">:&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;description&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;COCO 2017 Dataset&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;http://cocodataset.org&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;version&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;1.0&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;year&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">2017&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;contributor&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;COCO Consortium&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;date_created&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2017/09/01&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="lincenses">Lincenses&lt;/h2>
&lt;p>The “licenses” section contains a &lt;strong>list&lt;/strong> of image licenses that apply to images in the dataset&lt;/p>
&lt;p>Example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="s2">&amp;#34;licenses&amp;#34;&lt;/span>&lt;span class="err">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;http://creativecommons.org/licenses/by-nc-sa/2.0/&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Attribution-NonCommercial-ShareAlike License&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;http://creativecommons.org/licenses/by-nc/2.0/&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Attribution-NonCommercial License&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="err">...&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="images">Images&lt;/h2>
&lt;ul>
&lt;li>Contains the &lt;strong>complete list&lt;/strong> of images in your dataset&lt;/li>
&lt;li>No labels, bounding boxes, or segmentations specified in this part, it&amp;rsquo;s simply a list of images and information about each one.&lt;/li>
&lt;li>&lt;code>coco_url&lt;/code>, &lt;code>flickr_url&lt;/code>, and &lt;code>date_captured&lt;/code> are just for reference. Your deep learning application probably will only need the &lt;strong>&lt;code>file_name&lt;/code>&lt;/strong>.&lt;/li>
&lt;li>Image ids need to be &lt;strong>unique&lt;/strong> (among other images)&lt;/li>
&lt;li>They do not necessarily need to match the file name (unless the deep learning code you are using makes an assumption that they’ll be the same)&lt;/li>
&lt;/ul>
&lt;p>Example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="s2">&amp;#34;images&amp;#34;&lt;/span>&lt;span class="err">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;license&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">4&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;file_name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;000000397133.jpg&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;coco_url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;http://images.cocodataset.org/val2017/000000397133.jpg&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;height&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">427&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;width&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">640&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;date_captured&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2013-11-14 17:02:52&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;flickr_url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;http://farm7.staticflickr.com/6116/6255196340_da26cf2c9e_z.jpg&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">397133&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;license&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;file_name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;000000037777.jpg&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;coco_url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;http://images.cocodataset.org/val2017/000000037777.jpg&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;height&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">230&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;width&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">352&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;date_captured&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2013-11-14 20:55:31&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;flickr_url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;http://farm9.staticflickr.com/8429/7839199426_f6d48aa585_z.jpg&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">37777&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="err">...&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="annotations">Annotations&lt;/h2>
&lt;p>COCO has five annotation types: for &lt;a href="http://cocodataset.org/#detection-2018">object detection&lt;/a>, &lt;a href="http://cocodataset.org/#keypoints-2018">keypoint detection&lt;/a>, &lt;a href="http://cocodataset.org/#stuff-2018">stuff segmentation&lt;/a>, &lt;a href="http://cocodataset.org/#panoptic-2018">panoptic segmentation&lt;/a>, and &lt;a href="http://cocodataset.org/#captions-2015">image captioning&lt;/a>. The annotations are stored using &lt;a href="http://json.org/">JSON&lt;/a>.&lt;/p>
&lt;h3 id="object-detection">Object detection&lt;/h3>
&lt;p>it draws shapes around objects in an image. It has a list of &lt;strong>categories&lt;/strong> and &lt;strong>annotations&lt;/strong>.&lt;/p>
&lt;h4 id="categories">Categories&lt;/h4>
&lt;ul>
&lt;li>Contains a list of &lt;strong>categories&lt;/strong> (e.g. dog, boat)
&lt;ul>
&lt;li>each of those belongs to a &lt;strong>supercategory&lt;/strong> (e.g. animal, vehicle).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>The original COCO dataset contains 90 categories.&lt;/li>
&lt;li>You can use the existing COCO categories or create an entirely new list of your own.&lt;/li>
&lt;li>&lt;strong>Each category id must be unique (among the rest of the categories).&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>Example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="s2">&amp;#34;categories&amp;#34;&lt;/span>&lt;span class="err">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;person&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;person&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;vehicle&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;bicycle&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;vehicle&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;car&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="err">...&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="annotations-1">Annotations&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>&lt;code>segmentation&lt;/code> : list of points (represented as $(x, y)$ coordinate ) which define the shape of the object&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>area&lt;/code> : measured in pixels (e.g. a 10px by 20px box would have an area of 200)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>iscrowd&lt;/code> : specifies whether the segmentation is for a single object (&lt;code>iscrowd=0&lt;/code>) or for a group/cluster of objects (&lt;code>iscrowd=1&lt;/code>)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>image_id&lt;/code>: corresponds to a specific image in the dataset&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>bbox&lt;/code> : bounding box, format is &lt;code>[top left x position, top left y position, width, height]&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>category_id&lt;/code>: corresponds to a single category specified in the categories section&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>id&lt;/code>: Each annotation also has an id (unique to all other annotations in the dataset)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="s2">&amp;#34;annotations&amp;#34;&lt;/span>&lt;span class="err">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;segmentation&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[[&lt;/span>&lt;span class="mf">510.66&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mf">423.01&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mf">511.72&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mf">420.03&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="err">...&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mf">510.45&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mf">423.01&lt;/span>&lt;span class="p">]],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;area&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mf">702.1057499999998&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;iscrowd&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;image_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">289343&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;bbox&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mf">473.07&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mf">395.93&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mf">38.65&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mf">28.67&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;category_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">18&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1768&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="err">...&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>Has a segmentation list of vertices (x, y pixel positions)&lt;/li>
&lt;li>Has an area of 702 pixels (pretty small) and a bounding box of [473.07,395.93,38.65,28.67]&lt;/li>
&lt;li>Is not a crowd (meaning it’s a single object)&lt;/li>
&lt;li>Is category id of 18 (which is a dog)&lt;/li>
&lt;li>Corresponds with an image with id 289343 (which is a person on a strange bicycle and a tiny dog)&lt;/li>
&lt;/ul>
&lt;h2 id="example">Example&lt;/h2>
&lt;p>Source: &lt;a href="https://roboflow.com/formats/coco-json">https://roboflow.com/formats/coco-json&lt;/a>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;info&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;year&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2020&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;version&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;1&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;description&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Exported from roboflow.ai&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;contributor&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Roboflow&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;https://app.roboflow.ai/datasets/hard-hat-sample/1&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;date_created&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2000-01-01T00:00:00+00:00&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;licenses&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;https://creativecommons.org/publicdomain/zero/1.0/&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Public Domain&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;categories&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Workers&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;none&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;head&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Workers&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;helmet&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Workers&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;person&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Workers&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;images&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;license&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;file_name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;0001.jpg&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;height&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">275&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;width&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">490&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;date_captured&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2020-07-20T19:39:26+00:00&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;annotations&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;image_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;category_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;bbox&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">45&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">85&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">85&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;area&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">7225&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;segmentation&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;iscrowd&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;image_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;category_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;bbox&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">324&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">29&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">72&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">81&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;area&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">5832&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;segmentation&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;iscrowd&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://cocodataset.org/#format-data">COCO Data format&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://www.immersivelimit.com/tutorials/create-coco-annotations-from-scratch/#coco-dataset-format">Create COCO Annotations From Scratch&lt;/a>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Video tutorial&lt;/p>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/h6s61a_pqfM?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
>&lt;/iframe>
&lt;/div>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>You Only Look Once (YOLO)</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/yolo/</link><pubDate>Wed, 04 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/yolo/</guid><description>&lt;p>The problem of sliding windows method is that it does not output the most accuracte bounding boxes. A good way to get this output more accurate bounding boxes is with the &lt;strong>YOLO (You Only Look Once)&lt;/strong> algorithm.&lt;/p>
&lt;h2 id="overview-how-does-yolo-work">Overview: How does YOLO work?&lt;/h2>
&lt;p>Let&amp;rsquo;s say we have an input image (e.g. at 100x100), we&amp;rsquo;re going to place down a grid on this image. For the purpose of simplicity and illustration, we&amp;rsquo;re going to use a 3x3 grid as example.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-05%2011.39.03.png" alt="截屏2020-11-05 11.39.03" style="zoom:80%;" />
&lt;p>(In an actual implementation, we&amp;rsquo;ll use a finer one, like 19x19 grid)&lt;/p>
&lt;h3 id="labels-for-training">Labels for training&lt;/h3>
&lt;p>For &lt;strong>each&lt;/strong> grid cell, we specify a target label $\mathbf{y}$:
&lt;/p>
$$
\mathbf{y} = \left(
\begin{array}{c}
P\_c \\\\
b\_x \\\\
b\_y \\\\
b\_h \\\\
b\_w \\\\
c\_1 \\\\
c\_2 \\\\
\vdots \\\\
c\_n
\end{array}
\right)
\in \mathbb{R}^{5 + n}
$$
&lt;ul>
&lt;li>
&lt;p>$P\_c$: objectness&lt;/p>
&lt;ul>
&lt;li>depends on whether there&amp;rsquo;s an object in that grid cell.&lt;/li>
&lt;li>If yes, then $P\_c = 1$. else $P\_c=0$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Bounding box coordinates&lt;/p>
&lt;ul>
&lt;li>$b\_x, b\_y \in (0, 1)$: describe the center point of the object &lt;strong>relative&lt;/strong> to the grid cell
&lt;ul>
&lt;li>If $>1$, then the center point is outside of the current grid cell and it should be assigned to another grid cell&lt;/li>
&lt;li>Some parameterizations also use Sigmoid function to ensure $b\_x, b\_y \in (0, 1)$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>$b\_h, b\_w$: height and width of the bounding box,
&lt;ul>
&lt;li>specified as a fraction of the overall width of the grid cell (can be $\geq 1$)&lt;/li>
&lt;li>Some parameterizations also use exponential function to ensure non-negativity&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>$c\_1, c\_2, \dots, c\_n$: object classes probabilities we want to detect&lt;/p>
&lt;ul>
&lt;li>
&lt;p>E.g. we want to detect 3 classes of object:&lt;/p>
&lt;ul>
&lt;li>pedestrian ($c\_1$),&lt;/li>
&lt;li>car ($c\_2$),&lt;/li>
&lt;li>motorcycle ($c\_3$),&lt;/li>
&lt;/ul>
&lt;p>so our target $\mathbf{y}$ will be:
&lt;/p>
$$
\mathbf{y} = \left(
\begin{array}{c}
P\_c \\\\
b\_x \\\\
b\_y \\\\
b\_h \\\\
b\_w \\\\
c\_1 \\\\
c\_2 \\\\
c\_3
\end{array}
\right)
\in \mathbb{R}^{8}
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="example">Example&lt;/h4>
&lt;p>If we consider the upper left grid cell (at position $(0, 0)$)&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-05%2011.58.40.png" alt="截屏2020-11-05 11.58.40" style="zoom:80%;" />
&lt;p>There&amp;rsquo;s no object in this grid cell, so $P\_c = 0$, and we don&amp;rsquo;t have to care for the rest elements of $\mathbf{y}$:
&lt;/p>
$$
\mathbf{y} = \left(
\begin{array}{c}
0 \\\\
? \\\\
? \\\\
? \\\\
? \\\\
? \\\\
? \\\\
?
\end{array}
\right)
\in \mathbb{R}^{8}
$$
&lt;blockquote>
&lt;p>Here we use the symbol &lt;code>?&lt;/code>​ to mark &amp;ldquo;don&amp;rsquo;t care&amp;rdquo;.&lt;/p>
&lt;p>However, neural network can&amp;rsquo;t output a question mark, can&amp;rsquo;t output a &amp;ldquo;don&amp;rsquo;t care&amp;rdquo;. So wes&amp;rsquo;ll put some numbers for the rest. But these numbers will basically be ignored because the neural network is telling you that there&amp;rsquo;s no object there. So it doesn&amp;rsquo;t really matter whether the output is a bounding box or there&amp;rsquo;s is a car. So basically just be some set of numbers, more or less noise.&lt;/p>
&lt;/blockquote>
&lt;p>Now, how about the grid cells in the second row?&lt;/p>
&lt;p>To give a bit more detail, this image has two objects. And what the YOLO algorithm does is&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>it takes the midpoint of reach of the two objects and then assigns the object to the grid cell containing the midpoint.&lt;/strong> So the left car is assigned to the left grid cell marked with green; and the car on the rightis assigned to the grid cell marked with yellow.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-05%2012.18.07.png" alt="截屏2020-11-05 12.18.07" style="zoom:80%;" />
&lt;ul>
&lt;li>For the left grid cell marked with green, the target label $\mathbf{y}$ would be as follows:
$$
\mathbf{y} = \left(
\begin{array}{c}
1 \\\\
b\_x \\\\
b\_y \\\\
b\_h \\\\
b\_w \\\\
0 \\\\
1 \\\\
0
\end{array}
\right)
$$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Even though the central grid cell has some parts of both cars, we&amp;rsquo;ll pretend the central grid cell has &lt;strong>no&lt;/strong> interesting object. So the class label of the central grid cell is
&lt;/p>
$$
\mathbf{y} = \left(
\begin{array}{c}
0 \\\\
? \\\\
? \\\\
? \\\\
? \\\\
? \\\\
? \\\\
?
\end{array}
\right)
$$
&lt;/li>
&lt;/ul>
&lt;p>For each of these 9 grid cells, we end up with a 8 dimensional output vector. So the total target output volume is $(3 \times 3) \times 8$.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/yolo1.png" alt="yolo1">&lt;/p>
&lt;p>&lt;strong>Generally speaking, assuming that we have $n \times n$ grid cells, and we want to detect $C$ classes of objects, then the target output volume will be $(n \times n) \times (5 + C)$.&lt;/strong>&lt;/p>
&lt;h3 id="training">Training&lt;/h3>
&lt;p>To train our neural network, the input is $100 \times 100 \times 3$. And then we have a usual convnet with conv, layers of max pool layers, and so on. So that in the end, this eventually maps to a $3 \times 3 \times 8$ output volume. And so what we do is we have an input $X$ which is the input image like that, and we have these target labels $\mathbf{y}$ which are $3 \times 3 \times 8$, and we use backpropagation to train the neural network to map from any input $X$ to this type of output volume $\mathbf{y}$.&lt;/p>
&lt;h3 id="thumbsup-advantages">&amp;#x1f44d; Advantages&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>The neural network outputs precise bounding boxes &amp;#x1f44f;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Effeicient and fast thanks to convolution operations &amp;#x1f44f;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="intersection-over-union-iou">Intersection over Union (IoU)&lt;/h2>
&lt;p>How can we tell whether our object detection algorithm is working well?&lt;/p>
&lt;p>The &lt;strong>Intersection-over-Union (IoU)&lt;/strong>, aka Jaccard Index or Jaccard Overlap, measure the degree or extent to which two boxes overlap.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/IoU.jpg">&lt;figcaption>
&lt;h4>Intersection over Union (IoU). Src: [a-PyTorch-Tutorial-to-Object-Detection](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection)&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>In object detection:
&lt;/p>
$$
\text{IoU} = \frac{\text{Overlapping region between ground truth and prediction bounding box}}{\text{Combined region of ground truth and prediction bounding box}}
$$
&lt;p>
If $\text{IoU} \geq \text{threshold}$, we would say the prediction is correct.&lt;/p>
&lt;p>By convention, $\text{threshold} = 0.5$. We can also chosse other value greater than 0.5.&lt;/p>
&lt;p>Example:&lt;/p>
&lt;figure>&lt;img src="https://media5.datahacker.rs/2018/11/IoU.png">&lt;figcaption>
&lt;h4>IoU example. Src: [026 CNN Intersection over Union | Master Data Science](https://www.google.com/url?sa=i&amp;amp;url=http%3A%2F%2Fdatahacker.rs%2Fdeep-learning-intersection-over-union%2F&amp;amp;psig=AOvVaw2K4pvRAkwPw3FZYIelxngf&amp;amp;ust=1604671149058000&amp;amp;source=images&amp;amp;cd=vfe&amp;amp;ved=0CA0QjhxqFwoTCIjNgoLI6-wCFQAAAAAdAAAAABAg)&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;h2 id="non-max-suppresion">Non-max suppresion&lt;/h2>
&lt;p>One of the problems we have addressed in YOLO is that it can detect the same object multiple times.&lt;/p>
&lt;p>For example:&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/non-max-suppression.png">&lt;figcaption>
&lt;h4>Each car has two or more detections with different probabilities. The reason is that some of the grids that thinks that they contain the center point of the object. Src: [a-PyTorch-Tutorial-to-Object-Detection](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection)&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>&lt;strong>Non-max Suppression&lt;/strong> is a way to make sure that YOLO detects the object just once. It cleans up redundant detections. So they end up with just one detection per object, rather than multiple detections per object.&lt;/p>
&lt;ol>
&lt;li>Takes the detection with the largest $P\_c$ (the probability of a detection) &lt;em>(&amp;ldquo;That&amp;rsquo;s my most confident detection, so let&amp;rsquo;s highlight that and just say I found the car there.&amp;rdquo;)&lt;/em>&lt;/li>
&lt;li>Looks at all of the remaining rectangles and all the ones with a high overlap (i.e. with a high IOU), just suppress/darken/discard them&lt;/li>
&lt;/ol>
&lt;p>Example:&lt;/p>
&lt;figure>&lt;img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-10-at-9.46.29-PM.png">&lt;figcaption>
&lt;h4>Non-max suppression example. Src: [An overview of object detection: one-stage methods.](https://www.jeremyjordan.me/object-detection-one-stage/)&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>For multi-class detection, non-max suppression should be carried out &lt;strong>on each class separately&lt;/strong>.&lt;/p>
&lt;h2 id="anchor-box">Anchor box&lt;/h2>
&lt;p>One of the problems with object detection as we have seen it so far is that &lt;strong>each of the grid cells can detect only one object&lt;/strong>. What if a grid cell wants to detect multiple objects?&lt;/p>
&lt;p>For example: we want to detect 3 classes (pedestrians, cars, motorcycles), and our input image looks like this:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-05%2015.24.05.png" alt="截屏2020-11-05 15.24.05">&lt;/p>
&lt;p>The midpoint of the pedestrian and the midpoint of the car are in almost the same place and both of them fall into the same grid cell. If the output vector
&lt;/p>
$$
\mathbf{y} = \left(
\begin{array}{c}
P\_c \\\\
b\_x \\\\
b\_y \\\\
b\_h \\\\
b\_w \\\\
c\_1 \\\\
c\_2 \\\\
c\_3
\end{array}
\right)
$$
&lt;p>
we have seen before, it won&amp;rsquo;t be able to output two detections &amp;#x1f622;.&lt;/p>
&lt;p>With the idea of &lt;strong>anchor boxes&lt;/strong>, we are going to&lt;/p>
&lt;ul>
&lt;li>pre-defne a number of different shapes of anchor boxes (in this example, just 2)&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/anchor-box.png" alt="anchor-box">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>and associate them in the class labels
&lt;/p>
$$
\mathbf{y} = \left(\underbrace{P\_c, b\_x, b\_y, b\_h, b\_w, c\_1, c\_2, c\_3}\_{\text{anchor box 1}} , \underbrace{P\_c, b\_x, b\_y, b\_h, b\_w, c\_1, c\_2, c\_3}\_{\text{anchor box 2}}\right)^T \in \mathbb{R}^{16}
$$
&lt;ul>
&lt;li>Because the shape of the pedestrian is more similar to the shape of anchor box 1 than anchor box 2, we can use the first eight numbers to encode pedestrian.&lt;/li>
&lt;li>Because the box around the car is more similar to the shape of anchor box 2 than anchor box 1, we can then use the second 8 numbers to encode that the second object here is the car&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>To summarise, with a number of pre-defined anchor boxes: Each object in training image is assigned to&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>the grid cell that contains object&amp;rsquo;s midpoint and&lt;/strong>&lt;/li>
&lt;li>&lt;strong>anchor box for the grid cell with the highest IoU with the ground truth bounding box&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>In other words, now the object is assigned to a $(\text{grid cell}, \text{anchor box})$ pair.&lt;/strong>&lt;/p>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;p>If&lt;/p>
&lt;ul>
&lt;li>we have pre-defined $B$ different size of bounding boxes&lt;/li>
&lt;li>the size of input image is $n \times n$&lt;/li>
&lt;li>we want to detect $C$ classes of objects&lt;/li>
&lt;/ul>
&lt;p>Then the output volume will be
&lt;/p>
$$
(n \times n) \times B(5 + C)
$$
&lt;/span>
&lt;/div>
&lt;h3 id="how-to-choose-the-anchor-boxes">How to choose the anchor boxes?&lt;/h3>
&lt;ul>
&lt;li>People used to just choose them &lt;strong>by hand&lt;/strong> or choose maybe 5 or 10 anchor box shapes that spans a variety of shapes that seems to cover the types of objects to detect&lt;/li>
&lt;li>A better way to do this is to use a &lt;strong>K-means&lt;/strong> algorithm, to group together two types of objects shapes we tend to get. (in the later YOLO research paper)&lt;/li>
&lt;/ul>
&lt;h2 id="putting-them-all-together">Putting them all together&lt;/h2>
&lt;p>Suppose we&amp;rsquo;re trying to train a model to detect three classes of objects:&lt;/p>
&lt;ul>
&lt;li>pedestrians&lt;/li>
&lt;li>cars&lt;/li>
&lt;li>motorcycles&lt;/li>
&lt;/ul>
&lt;p>And the input image looks like this:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-05%2016.07.04.png" alt="截屏2020-11-05 16.07.04" style="zoom:80%;" />
&lt;p>Suppose we have pre-defined two different sizes of bounding boxes&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/anchor-box.png" alt="anchor-box">&lt;/p>
&lt;p>Anchor box 2 has a higher IoU with the ground truth bounding box of the car, then:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/YOLO.png" alt="YOLO" style="zoom:80%;" />
&lt;p>The final output volume is $3 \times 3 \times 2 \times 8$&lt;/p>
&lt;h3 id="making-predictions">Making predictions&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-11-05%2016.13.12.png" alt="截屏2020-11-05 16.13.12">&lt;/p>
&lt;h3 id="outputing-the-non-max-supressed-outputs">Outputing the non-max supressed outputs&lt;/h3>
&lt;p>Let&amp;rsquo;s look at an new input image,&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-05%2016.17.32-20201105162152696.png" alt="截屏2020-11-05 16.17.32" style="zoom:67%;" />
&lt;p>and suppose that we still use 2 pre-defined anthor boxes for detecting pedestrians, cars, and motorcycles.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>For each grid cell, get 2 predicted bounding boxes. Notice that some of the bounding boxes can go outside the height and width of the grid cell that they came from&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-05%2016.17.41.png" alt="截屏2020-11-05 16.17.41" style="zoom:67%;" />
&lt;/li>
&lt;li>
&lt;p>Get rid of low probability predictions&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-05%2016.18.48.png" alt="截屏2020-11-05 16.18.48" style="zoom: 67%;" />
&lt;/li>
&lt;li>
&lt;p>For each class, use non-max suppression to generate final predictions. And so the output of this is hopefully that we will have detected all the cars and all the pedestrians in this image.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-11-05%2016.24.29-20201105162519547.png" alt="截屏2020-11-05 16.24.29" style="zoom:67%;" />
&lt;/li>
&lt;/ol>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://www.coursera.org/learn/convolutional-neural-networks">Convolutional Neural Network, &lt;em>Andrew Ng&lt;/em>&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://www.jeremyjordan.me/object-detection-one-stage/">An overview of object detection: one-stage methods.&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul></description></item><item><title>YOLOv4: Run Pretrained YOLOv4 on COCO Dataset</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/train-yolo-v4/</link><pubDate>Wed, 04 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/train-yolo-v4/</guid><description>&lt;p>Here we will learn how to get YOLOv4 Object Detection running in the Cloud with Google Colab step by step.&lt;/p>
&lt;p>Check out the &lt;a href="https://colab.research.google.com/drive/1o-xfVm7A-kgtFZRrehJvnibuBwzNPs1-?authuser=1#scrollTo=P5WqSvgwqmLT">Google Colab Notebook&lt;/a>&lt;/p>
&lt;h2 id="clone-and-build-darknet">Clone and build DarkNet&lt;/h2>
&lt;p>Clone darknet from AlexeyAB&amp;rsquo;s &lt;a href="https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects">repository&lt;/a>,&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">!git clone https://github.com/AlexeyAB/darknet
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Adjust the Makefile to enable OPENCV and GPU for darknet&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># change makefile to have GPU and OPENCV enabled&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">%cd darknet
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">!sed -i &lt;span class="s1">&amp;#39;s/OPENCV=0/OPENCV=1/&amp;#39;&lt;/span> Makefile
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">!sed -i &lt;span class="s1">&amp;#39;s/GPU=0/GPU=1/&amp;#39;&lt;/span> Makefile
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">!sed -i &lt;span class="s1">&amp;#39;s/CUDNN=0/CUDNN=1/&amp;#39;&lt;/span> Makefile
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">!sed -i &lt;span class="s1">&amp;#39;s/CUDNN_HALF=0/CUDNN_HALF=1/&amp;#39;&lt;/span> Makefile
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Verify CUDA&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># verify CUDA&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">!/usr/local/cuda/bin/nvcc --version
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Build darknet&lt;/p>
&lt;blockquote>
&lt;p>Note: Do not worry about any warnings when running the &lt;code>!make&lt;/code> cell!&lt;/p>
&lt;/blockquote>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># make darknet &lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># (builds darknet so that you can then use the darknet executable file &lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># to run or train object detectors)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">!make
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="download-pretrained-yolo-v4-weights">Download pretrained YOLO v4 weights&lt;/h2>
&lt;p>YOLOv4 has been trained already on the coco dataset which has 80 classes that it can predict. We will grab these pretrained weights so that we can run YOLOv4 on these pretrained classes and get detections.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">!wget https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="define-helper-functions">Define helper functions&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">cv2&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">matplotlib.pyplot&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">plt&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="o">%&lt;/span>&lt;span class="n">matplotlib&lt;/span> &lt;span class="n">inline&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">imShow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Show image
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">cv2&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">height&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">width&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">image&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">[:&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">resized_image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">cv2&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">resize&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">3&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">width&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="n">height&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">interpolation&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">cv2&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">INTER_CUBIC&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fig&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">gcf&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fig&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_size_inches&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">18&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">10&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;off&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">cv2&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cvtColor&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">resized_image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">cv2&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">COLOR_BGR2RGB&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">upload&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> upload files to Google Colab
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="kn">from&lt;/span> &lt;span class="nn">google.colab&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">files&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">uploaded&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">files&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">upload&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">name&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">data&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">uploaded&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">items&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">with&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;wb&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">f&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">write&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">data&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s1">&amp;#39;saved file &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">name&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">download&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Download from Google Colab
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="kn">from&lt;/span> &lt;span class="nn">google.colab&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">files&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">files&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">download&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="run-detections-with-darknet-and-yolov4">Run detections with Darknet and YOLOv4&lt;/h2>
&lt;p>The object detector can be run using the following command&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">!./darknet detector &lt;span class="nb">test&lt;/span> &amp;lt;path to .data file&amp;gt; &amp;lt;path to config&amp;gt; &amp;lt;path to weights&amp;gt; &amp;lt;path to image&amp;gt;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This will output the image with the detections shown. The most recent detections are always saved to &amp;lsquo;&lt;strong>predictions.jpg&lt;/strong>&amp;rsquo;&lt;/p>
&lt;p>&lt;strong>Note:&lt;/strong> After running detections OpenCV can&amp;rsquo;t open the image instantly in the cloud so we must run:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">imShow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;predictions.jpg&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Darknet comes with a few images already installed in the &lt;code>darknet/data/&lt;/code> folder. Let&amp;rsquo;s test one of the images inside:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># run darknet detection on test images&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">!./darknet detector &lt;span class="nb">test&lt;/span> cfg/coco.data cfg/yolov4.cfg yolov4.weights data/person.jpg
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">imShow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;predictions.jpg&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/predictions.png" alt="predictions">&lt;/p>
&lt;h3 id="run-detections-using-uploaded-image">Run detections using uploaded image&lt;/h3>
&lt;p>We can also mount Google drive into the cloud VM a&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">google.colab&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">drive&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">drive&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mount&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;/content/gdrive&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># this creates a symbolic link &lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># so that now the path /content/gdrive/My\ Drive/ is equal to /mydrive&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">!ln -s /content/gdrive/My&lt;span class="se">\ &lt;/span>Drive/ /mydrive
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">!ls /mydrive
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>nd run YOLOv4 with images from Google drive using the following command:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">!./darknet detector &lt;span class="nb">test&lt;/span> cfg/coco.data cfg/yolov4.cfg yolov4.weights /mydrive/&amp;lt;path to image&amp;gt;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>For example, I uploaded an image called &amp;ldquo;pedestrian.jpg&amp;rdquo; in &lt;code>images/&lt;/code> folder:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/pedestrian.jpg" alt="pedestrian">&lt;/p>
&lt;p>and run detection on it:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="o">./&lt;/span>&lt;span class="n">darknet&lt;/span> &lt;span class="n">detector&lt;/span> &lt;span class="n">test&lt;/span> &lt;span class="n">cfg&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">coco&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">data&lt;/span> &lt;span class="n">cfg&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">yolov4&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cfg&lt;/span> &lt;span class="n">yolov4&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">weights&lt;/span> &lt;span class="o">/&lt;/span>&lt;span class="n">mydrive&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">images&lt;/span>&lt;span class="o">/&lt;/span>&lt;span class="n">pedestrian&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">jpg&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">imShow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;predictions.jpg&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/pedestrian_predictions.png" alt="pedestrian_predictions">&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>YOLOv4 in the CLOUD: Install and Run Object Detector (FREE GPU)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://colab.research.google.com/drive/1_GdoqCJWXsChrOiY8sZMr_zbr_fH-0Fg?usp=sharing#scrollTo=iZULaGX7_H1u">Google Colab Notebook&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://github.com/theAIGuysCode/YOLOv4-Cloud-Tutorial">https://github.com/theAIGuysCode/YOLOv4-Cloud-Tutorial&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Video Tutorial&lt;/p>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/mKAEGSxwOAY?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
>&lt;/iframe>
&lt;/div>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>YOLOv4: Train on Custom Dataset</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/train-yolo-v4-custom-dataset/</link><pubDate>Wed, 04 Nov 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/train-yolo-v4-custom-dataset/</guid><description>&lt;h2 id="clone-and-build-darknet">Clone and build Darknet&lt;/h2>
&lt;p>Clone darknet repo&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">git clone https://github.com/AlexeyAB/darknet
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Change makefile to have GPU and OPENCV enabled&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="nb">cd&lt;/span> darknet
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">sed -i &lt;span class="s1">&amp;#39;s/OPENCV=0/OPENCV=1/&amp;#39;&lt;/span> Makefile
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">sed -i &lt;span class="s1">&amp;#39;s/GPU=0/GPU=1/&amp;#39;&lt;/span> Makefile
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">sed -i &lt;span class="s1">&amp;#39;s/CUDNN=0/CUDNN=1/&amp;#39;&lt;/span> Makefile
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">sed -i &lt;span class="s1">&amp;#39;s/CUDNN_HALF=0/CUDNN_HALF=1/&amp;#39;&lt;/span> Makefile
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Verify CUDA&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">/usr/local/cuda/bin/nvcc --version
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="compile-on-linux-using-make">Compile on Linux using &lt;code>make&lt;/code>&lt;/h2>
&lt;p>Make darknet&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">make
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>&lt;code>GPU=1&lt;/code> : build with CUDA to accelerate by using GPU&lt;/li>
&lt;li>&lt;code>CUDNN=1&lt;/code> : build with cuDNN v5-v7 to accelerate training by using GPU&lt;/li>
&lt;li>&lt;code>CUDNN_HALF=1&lt;/code> to build for Tensor Cores (on Titan V / Tesla V100 / DGX-2 and later) speedup Detection 3x, Training 2x&lt;/li>
&lt;li>&lt;code>OPENCV=1&lt;/code> to build with OpenCV 4.x/3.x/2.4.x - allows to detect on video files and video streams from network cameras or web-cams&lt;/li>
&lt;li>&lt;code>DEBUG=1&lt;/code> to bould debug version of Yolo&lt;/li>
&lt;li>&lt;code>OPENMP=1&lt;/code> to build with OpenMP support to accelerate Yolo by using multi-core CPU&lt;/li>
&lt;/ul>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">Do not worry about any warnings when running &lt;code>make&lt;/code> command.&lt;/span>
&lt;/div>
&lt;h2 id="prepare-custom-dataset">Prepare custom dataset&lt;/h2>
&lt;p>The custom dataset should be in &lt;strong>YOLOv4&lt;/strong> or &lt;strong>darknet&lt;/strong> format:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>For each &lt;code>.jpg&lt;/code> image file, there should be a corresponding &lt;code>.txt&lt;/code> file&lt;/p>
&lt;ul>
&lt;li>
&lt;p>In the same directory, with the same name, but with &lt;code>.txt&lt;/code>-extension&lt;/p>
&lt;p>For example, if there&amp;rsquo;s an &lt;code>.jpg&lt;/code> image named &lt;code>BloodImage_00001.jpg&lt;/code>, there should also be a corresponding &lt;code>.txt&lt;/code> file named &lt;code>BloodImage_00001.txt&lt;/code>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>In this &lt;code>.txt&lt;/code> file: object number and object coordinates on this image, for each object in new line.&lt;/p>
&lt;p>Format:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">&amp;lt;object-class&amp;gt; &amp;lt;x_center&amp;gt; &amp;lt;y_center&amp;gt; &amp;lt;width&amp;gt; &amp;lt;height&amp;gt;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>&lt;code>&amp;lt;object-class&amp;gt;&lt;/code> : integer object number from &lt;code>0&lt;/code> to &lt;code>(classes-1)&lt;/code>&lt;/li>
&lt;li>&lt;code>&amp;lt;x_center&amp;gt; &amp;lt;y_center&amp;gt; &amp;lt;width&amp;gt; &amp;lt;height&amp;gt;&lt;/code> : float values &lt;strong>relative&lt;/strong> to width and height of image, it can be equal from &lt;code>(0.0 to 1.0]&lt;/code>
&lt;ul>
&lt;li>&lt;code>&amp;lt;x_center&amp;gt; &amp;lt;y_center&amp;gt;&lt;/code> are center of rectangle (are not top-left corner)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="configure-files-for-training">Configure files for training&lt;/h2>
&lt;ol start="0">
&lt;li>
&lt;p>For training &lt;code>cfg/yolov4-custom.cfg&lt;/code> download the pre-trained weights-file &lt;a href="https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.conv.137">yolov4.conv.137&lt;/a>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="nb">cd&lt;/span> darknet
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">wget https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.conv.137
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>In folder &lt;code>./cfg&lt;/code>, create custom config file (let&amp;rsquo;s call it &lt;code>custom-yolov4-detector.cfg&lt;/code>) with the same content as in &lt;code>yolov4-custom.cfg&lt;/code> and&lt;/p>
&lt;ul>
&lt;li>
&lt;p>change line &lt;strong>batch&lt;/strong> to &lt;a href="https://github.com/AlexeyAB/darknet/blob/0039fd26786ab5f71d5af725fc18b3f521e7acfd/cfg/yolov3.cfg#L3">&lt;code>batch=64&lt;/code>&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>change line &lt;strong>subdivisions&lt;/strong> to &lt;a href="https://github.com/AlexeyAB/darknet/blob/0039fd26786ab5f71d5af725fc18b3f521e7acfd/cfg/yolov3.cfg#L4">&lt;code>subdivisions=16&lt;/code>&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>change line &lt;strong>max_batches&lt;/strong> to &lt;code>classes*2000&lt;/code> but&lt;/p>
&lt;ul>
&lt;li>NOT less than number of training images&lt;/li>
&lt;li>NOT less than number of training images&lt;/li>
&lt;li>NOT less than 6000&lt;/li>
&lt;/ul>
&lt;p>&lt;em>e.g. &lt;code>max_batches=6000&lt;/code> if you train for 3 classes&lt;/em>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>change line &lt;strong>steps&lt;/strong> to 80% and 90% of &lt;strong>max_batches&lt;/strong> (&lt;em>e.g. &lt;code>steps=4800, 5400&lt;/code>&lt;/em>)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>set network size &lt;code>width=416 height=416&lt;/code> or any value multiple of 32&lt;/p>
&lt;/li>
&lt;li>
&lt;p>change line &lt;code>classes=80&lt;/code> to number of objects in &lt;strong>each&lt;/strong> of 3 &lt;code>[yolo]&lt;/code>-layers&lt;/p>
&lt;/li>
&lt;li>
&lt;p>change [&lt;code>filters=255&lt;/code>] to $ \text{filters}=(\text{classes} + 5) \times 3$ in the 3 &lt;code>[convolutional]&lt;/code> before each &lt;code>[yolo]&lt;/code> layer, keep in mind that it only has to be the last &lt;code>[convolutional]&lt;/code> before each of the &lt;code>[yolo]&lt;/code> layers.&lt;/p>
&lt;blockquote>
&lt;p>Note: &lt;strong>Do not write in the cfg-file: &lt;code>filters=(classes + 5) x 3&lt;/code>&lt;/strong>!!!&lt;/p>
&lt;p>It has to be the specific number!&lt;/p>
&lt;p>E.g. &lt;code>classes=1&lt;/code> then should be &lt;code>filters=18&lt;/code>; &lt;code>classes=2&lt;/code> then should be &lt;code>filters=21&lt;/code>&lt;/p>
&lt;p>So for example, for 2 objects, your custom config file should differ from &lt;code>yolov4-custom.cfg&lt;/code> in such lines in &lt;strong>each&lt;/strong> of &lt;strong>3&lt;/strong> [yolo]-layers:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">[convolutional]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">filters=21
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">[region]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">classes=2
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/blockquote>
&lt;/li>
&lt;li>
&lt;p>when using &lt;a href="https://github.com/AlexeyAB/darknet/blob/6e5bdf1282ad6b06ed0e962c3f5be67cf63d96dc/cfg/Gaussian_yolov3_BDD.cfg#L608">&lt;code>[Gaussian_yolo]&lt;/code>&lt;/a> layers, change [&lt;code>filters=57&lt;/code>] $ \text{filters}=(\text{classes} + 9) \times 3$ in the 3 &lt;code>[convolutional]&lt;/code> before each &lt;code>[Gaussian_yolo]&lt;/code> layer&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Create file &lt;code>obj.names&lt;/code> in the directory &lt;code>data/&lt;/code>, with objects names - each in new line&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Create fiel &lt;code>obj.data&lt;/code> in the directory &lt;code>data/&lt;/code>, containing (where &lt;strong>classes = number of objects&lt;/strong>):&lt;/p>
&lt;p>For example, if we two objects&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">classes = 2
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">train = data/train.txt
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">valid = data/test.txt
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">names = data/obj.names
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">backup = backup/
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>Put image files (&lt;code>.jpg&lt;/code>) of your objects in the directory &lt;code>data/obj/&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Create &lt;code>train.txt&lt;/code> in directory &lt;code>data/&lt;/code> with filenames of your images, each filename in new line, with path relative to &lt;code>darknet&lt;/code>.&lt;/p>
&lt;p>For example containing:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">data/obj/img1.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">data/obj/img2.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">data/obj/img3.jpg
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>Download pre-trained weights for the convolutional layers and put to the directory &lt;code>darknet&lt;/code> (root directory of the project)&lt;/p>
&lt;ul>
&lt;li>for &lt;code>yolov4.cfg&lt;/code>, &lt;code>yolov4-custom.cfg&lt;/code> (162 MB): &lt;a href="https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.conv.137">yolov4.conv.137&lt;/a>&lt;/li>
&lt;li>for &lt;code>yolov4-tiny.cfg&lt;/code>, &lt;code>yolov4-tiny-3l.cfg&lt;/code>, &lt;code>yolov4-tiny-custom.cfg&lt;/code>(19 MB): &lt;a href="https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.conv.29">yolov4-tiny.conv.29&lt;/a>&lt;/li>
&lt;li>for &lt;code>csresnext50-panet-spp.cfg&lt;/code> (133 MB): &lt;a href="https://drive.google.com/file/d/16yMYCLQTY_oDlCIZPfn_sab6KD3zgzGq/view?usp=sharing">csresnext50-panet-spp.conv.112&lt;/a>&lt;/li>
&lt;li>for &lt;code>yolov3.cfg, yolov3-spp.cfg&lt;/code> (154 MB): &lt;a href="https://pjreddie.com/media/files/darknet53.conv.74">darknet53.conv.74&lt;/a>&lt;/li>
&lt;li>for &lt;code>yolov3-tiny-prn.cfg , yolov3-tiny.cfg&lt;/code> (6 MB): &lt;a href="https://drive.google.com/file/d/18v36esoXCh-PsOKwyP2GWrpYDptDY8Zf/view?usp=sharing">yolov3-tiny.conv.11&lt;/a>&lt;/li>
&lt;li>for &lt;code>enet-coco.cfg (EfficientNetB0-Yolov3)&lt;/code> (14 MB): &lt;a href="https://drive.google.com/file/d/1uhh3D6RSn0ekgmsaTcl-ZW53WBaUDo6j/view?usp=sharing">enetb0-coco.conv.132&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h2 id="start-training">Start training&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">./darknet detector train data/obj.data custom-yolov4-detector.cfg yolov4.conv.137 -dont_show
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>
&lt;p>file &lt;code>yolo-obj_last.weights&lt;/code> will be saved to the &lt;code>backup\&lt;/code> for each 100 iterations&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>-dont_show&lt;/code>: disable Loss-Window, if you train on computer without monitor (e.g remote server)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>To see the mAP &amp;amp; loss0chart during training on remote server:&lt;/p>
&lt;ul>
&lt;li>use command &lt;code>./darknet detector train data/obj.data yolo-obj.cfg yolov4.conv.137 -dont_show -mjpeg_port 8090 -map&lt;/code>&lt;/li>
&lt;li>then open URL &lt;code>http://ip-address:8090&lt;/code> in Chrome/Firefox browser)&lt;/li>
&lt;/ul>
&lt;p>After training is complete, you can get weights from &lt;code>backup/&lt;/code>&lt;/p>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;p>If you want the training to output only main information (e.g loss, mAP, remaining training time) instead of full logging, you can use this command&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">./darknet detector train data/obj.data custom-yolov4-detector.cfg yolov4.conv.137 -dont_show -map 2&amp;gt;&lt;span class="p">&amp;amp;&lt;/span>&lt;span class="m">1&lt;/span> &lt;span class="p">|&lt;/span> tee log/train.log &lt;span class="p">|&lt;/span> grep -E &lt;span class="s2">&amp;#34;hours left|mean_average&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Then the output will look like followings:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl"> 1189: 1.874030, 2.934438 avg loss, 0.002610 rate, 2.930427 seconds, 76096 images, 3.905244 hours left
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/span>
&lt;/div>
&lt;h3 id="notes">Notes&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>If during training you see &lt;code>nan&lt;/code> values for &lt;code>avg&lt;/code> (loss) field - then training goes wrong! ​🤦‍♂️​&lt;/p>
&lt;p>But if &lt;code>nan&lt;/code> is in some other lines - then training goes well.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>if error &lt;code>Out of memory&lt;/code> occurs then in &lt;code>.cfg&lt;/code>-file you should increase &lt;code>subdivisions=16&lt;/code>, 32 or 64&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="train-tiny-yolo">Train tiny-YOLO&lt;/h2>
&lt;p>Do all the same steps as for the full yolo model as described above. With the exception of:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Download file with the first 29-convolutional layers of yolov4-tiny:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">wget https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.conv.29
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>(Or get this file from yolov4-tiny.weights file by using command: &lt;code>./darknet partial cfg/yolov4-tiny-custom.cfg yolov4-tiny.weights yolov4-tiny.conv.29 29&lt;/code>)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Make your custom model &lt;code>yolov4-tiny-obj.cfg&lt;/code> based on &lt;code>cfg/yolov4-tiny-custom.cfg&lt;/code> instead of &lt;code>yolov4.cfg&lt;/code>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">re&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># num_classes: number of object classes&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">max_batches&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">max&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">num_classes&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">2000&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">num_train_images&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">6000&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">steps1&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mf">.8&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">max_batches&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">steps2&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mf">.9&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">max_batches&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">num_filters&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">num_classes&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="mi">5&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="mi">3&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Assuming that we have already defined the following hyperparameters:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># - TINY_CONFIG_FILE: config file we&amp;#39;re gonna use for training&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># - WIDTH, HEIGHT: width and height of image&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">with&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;cfg/yolov4-tiny-custom.cfg&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;r&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">reader&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">TINY_CONFIG_FILE&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;w&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">writer&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">content&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">reader&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">read&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">content&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">re&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sub&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;subdivisions=\d*&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;subdivisions=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">SUBDIVISION&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">content&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">content&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">re&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sub&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;width=\d*&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;width=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">WIDTH&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">content&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">content&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">re&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sub&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;height=\d*&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;height=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">HEIGHT&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">content&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">content&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">re&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sub&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;max_batches = \d*&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;max_batches = &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">max_batches&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">content&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">content&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">re&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sub&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;steps=\d*,\d*&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;steps=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">steps1&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">,&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">steps2&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">content&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">content&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">re&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sub&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;classes=\d*&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;classes=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">num_classes&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">content&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">content&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">re&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sub&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;pad=1&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s2">filters=\d*&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;pad=1&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s2">filters=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">num_filters&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">content&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">writer&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">write&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">content&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>Start training:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">./darknet detector train data/obj.data yolov4-tiny-obj.cfg yolov4-tiny.conv.29
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;h2 id="google-colab-notebook">Google Colab Notebook&lt;/h2>
&lt;p>&lt;a href="https://colab.research.google.com/drive/1aIc5xS8vVukVg-FiUA3aw0PUqYrXs8aO?authuser=1#scrollTo=Zz8v67_2kgWh">Colab Notebook&lt;/a>&lt;/p>
&lt;h3 id="small-hacks-to-keep-colab-notebook-training">Small hacks to keep colab notebook training&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>Open up the inspector view on Chrome&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Switch to the console window&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Paste the following code&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-javascript" data-lang="javascript">&lt;span class="line">&lt;span class="cl">&lt;span class="kd">function&lt;/span> &lt;span class="nx">ClickConnect&lt;/span>&lt;span class="p">(){&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nx">console&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="nx">log&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Working&amp;#34;&lt;/span>&lt;span class="p">);&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">document&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">.&lt;/span>&lt;span class="nx">querySelector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;#top-toolbar &amp;gt; colab-connect-button&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">.&lt;/span>&lt;span class="nx">shadowRoot&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="nx">querySelector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;#connect&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">.&lt;/span>&lt;span class="nx">click&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nx">setInterval&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nx">ClickConnect&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">60000&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>and hit &lt;strong>Enter&lt;/strong>.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>It will click the screen every 10 minutes so that you don&amp;rsquo;t get kicked off for being idle!&lt;/p>
&lt;h2 id="convert-yolov4-to-tensorrt-through-onnx">Convert YOLOv4 to TensorRT through ONNX&lt;/h2>
&lt;p>To convert YOLOv4 to TensorRT engine through ONNX, I used the code from &lt;a href="https://github.com/jkjung-avt/tensorrt_demos">TensorRT_demos&lt;/a> following its &lt;a href="https://github.com/jkjung-avt/tensorrt_demos#demo-5-yolov4">step-by-step instructions&lt;/a>. For more details about the code, check out this &lt;a href="https://jkjung-avt.github.io/tensorrt-yolov4/">blog post&lt;/a>.&lt;/p>
&lt;p>Note that the Code in this repo was designed to run on &lt;a href="https://developer.nvidia.com/embedded-computing">Jetson platforms&lt;/a>. In my case, conversion from YOLOv4 to TensorRT engine was conducted on Jetson Nano.&lt;/p>
&lt;h3 id="convert-yolov4-for-custom-trained-models">Convert YOLOv4 for custom trained models&lt;/h3>
&lt;p>To apply the conversion for custom trained models, see &lt;a href="https://jkjung-avt.github.io/trt-yolov3-custom/">TensorRT YOLOv3 For Custom Trained Models&lt;/a>. You need to stick to the naming convention &lt;code>{yolo_version}-{custom_name}-{image_size}&lt;/code>. Otherwise you&amp;rsquo;ll get errors during conversion.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Guide from &lt;a href="https://github.com/AlexeyAB">AlexeyAB&lt;/a>/&lt;strong>&lt;a href="https://github.com/AlexeyAB/darknet">darknet&lt;/a>&lt;/strong> repo: &lt;a href="https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects">How to train (to detect your custom objects)&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Tutorials&lt;/p>
&lt;ul>
&lt;li>
&lt;p>👨‍🏫 How to Train YOLOv4 on a Custom Dataset in Darknet&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://colab.research.google.com/drive/1mzL6WyY9BRx4xX476eQdhKDnd_eixBlG?authuser=0#scrollTo=QyMBDkaL-Aep">Colab Notebook&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Blog post: &lt;a href="https://blog.roboflow.com/training-yolov4-on-a-custom-dataset/">https://blog.roboflow.com/training-yolov4-on-a-custom-dataset/&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Video tutorial:&lt;/p>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/N-GS8cmDPog?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
>&lt;/iframe>
&lt;/div>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://blog.roboflow.com/yolov4-tactics/">YOLOv4 - Ten Tactics to Build a Better Model&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Train YOLOv4-tiny on custom dataset: &lt;a href="https://blog.roboflow.com/train-yolov4-tiny-on-custom-data-lighting-fast-detection/">Train YOLOv4-tiny on Custom Data - Lightning Fast Object Detection&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>YOLOv4 in the CLOUD: Build and Train Custom Object Detector (FREE GPU)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://colab.research.google.com/drive/1_GdoqCJWXsChrOiY8sZMr_zbr_fH-0Fg#scrollTo=O2w9w1Ye_nk1">Colab Notebook&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Video tutorial:&lt;/p>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/mmj3nxGT2YQ?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
>&lt;/iframe>
&lt;/div>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://jkjung-avt.github.io/colab-yolov4/">Custom YOLOv4 Model on Google Colab&lt;/a>&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://colab.research.google.com/drive/1eoa2_v6wVlcJiDBh3Tb_umhm7a09lpIE?usp=sharing#scrollTo=J1oTF_YRoGSZ">Colab Notebook&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://jkjung-avt.github.io/tensorrt-yolov4/">TensorRT YOLOv4&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://jkjung-avt.github.io/yolov4/">YOLOv4 on Jetson Nano&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>Annotation Conversion: COCO JSON to YOLO Txt</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/coco-json-to-yolo-txt/</link><pubDate>Wed, 02 Dec 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/coco-json-to-yolo-txt/</guid><description>&lt;h2 id="bounding-box-formats-comparison-and-conversion">Bounding box formats comparison and conversion&lt;/h2>
&lt;p>In COCO Json, the format of bounding box is:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="s2">&amp;#34;bbox&amp;#34;&lt;/span>&lt;span class="err">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="err">&amp;lt;absolute_x_top_left&amp;gt;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="err">&amp;lt;absolute_y_top_left&amp;gt;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="err">&amp;lt;absolute_width&amp;gt;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="err">&amp;lt;absolute_height&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>However, the annotation is different in YOLO. For each &lt;code>.jpg&lt;/code> image, there&amp;rsquo;s a &lt;code>.txt&lt;/code> file (in the same directory and with the same name, but with &lt;code>.txt&lt;/code>-extension). This &lt;code>.txt&lt;/code> file holds the objects and their bounding boxes in this image (one line for each object), in the following format &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">&amp;lt;object-class&amp;gt; &amp;lt;relative_x_center&amp;gt; &amp;lt;relative_y_center&amp;gt; &amp;lt;relative_width&amp;gt; &amp;lt;relative_height&amp;gt;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>
&lt;p>&lt;code>&amp;lt;object-class&amp;gt;&lt;/code> : integer number of object from &lt;strong>&lt;code>0&lt;/code> to &lt;code>(classes-1)&lt;/code>&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>&amp;lt;relative_x_center&amp;gt; &amp;lt;relative_y_center&amp;gt; &amp;lt;relative_width&amp;gt; &amp;lt;relative_height&amp;gt;&lt;/code>&lt;/p>
&lt;p>float values relative to width and height of image (equal from (0.0 to 1.0])&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>For example, for &lt;code>img1.jpg&lt;/code> there should be &lt;code>img1.txt&lt;/code> containing something looks like followings:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">1 0.716797 0.395833 0.216406 0.147222
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">0 0.687109 0.379167 0.255469 0.158333
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">2 0.420312 0.395833 0.140625 0.166667
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The following figure illustrates the difference of bounding box annotation between COCO and YOLO:&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/annotation-convertion-COCO-and-YOLO.png">&lt;figcaption>
&lt;h4>Bounding box format: COCO vs YOLO&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>Convert the bounding box annotation format from COCO to YOLO:
&lt;/p>
$$
\begin{array}{ll}
x\_{yolo} &amp;= (x\_{coco} + \frac{w\_{coco}}{2}) / w\_{img} \\\\
y\_{yolo} &amp;= (y\_{coco} + \frac{h\_{coco}}{2}) / h\_{img} \\\\
w\_{yolo} &amp;= w\_{coco} / w\_{img} \\\\
h\_{yolo} &amp;= h\_{coco} / h\_{img}
\end{array}
$$
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">convert_bbox_coco2yolo&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">img_width&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">img_height&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">bbox&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Convert bounding box from COCO format to YOLO format
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Parameters
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> ----------
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> img_width : int
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> width of image
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> img_height : int
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> height of image
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> bbox : list[int]
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> bounding box annotation in COCO format:
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> [top left x position, top left y position, width, height]
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Returns
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> -------
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> list[float]
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> bounding box annotation in YOLO format:
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> [x_center_rel, y_center_rel, width_rel, height_rel]
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># YOLO bounding box format: [x_center, y_center, width, height]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># (float values relative to width and height of image)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x_tl&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y_tl&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">w&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">h&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">bbox&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dw&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mf">1.0&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="n">img_width&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dh&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mf">1.0&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="n">img_height&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x_center&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">x_tl&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">w&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="mf">2.0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">y_center&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">y_tl&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">h&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="mf">2.0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">x_center&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">dw&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">y_center&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">dh&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">w&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">w&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">dw&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">h&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">h&lt;/span> &lt;span class="o">*&lt;/span> &lt;span class="n">dh&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">w&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">h&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="convert-coco-json-to-yolo-txt">Convert COCO JSON to YOLO txt&lt;/h2>
&lt;p>The structure of training set in COCO format is:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">- train
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- _annotations.coco.json
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- img_001.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- img_002.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- img_003.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ...
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>_annotations.coco.json&lt;/code> contains all information about the dataset, images, and annotations. (More see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/coco-dataset-format/">COCO JSON Format for Object Detection&lt;/a>)&lt;/p>
&lt;p>The structure of training set in YOLO format is:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">- train
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- _darknet.labels
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- img_001.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- img_001.txt
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- img_002.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- img_002.txt
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- img_003.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- img_003.txt
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ...
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>&lt;code>_darknet.labels&lt;/code> contains objects names, each in new line&lt;/li>
&lt;li>For each &lt;code>.jpg&lt;/code> image there&amp;rsquo;s a corresponding &lt;code>.txt&lt;/code> file with the same name&lt;/li>
&lt;/ul>
&lt;p>Now we create &lt;code>.txt&lt;/code> file for each image based on &lt;code>_annotations.coco.json&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">os&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">json&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">tqdm&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">tqdm&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">shutil&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">make_folders&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;output&amp;#34;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">exists&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">shutil&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rmtree&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">makedirs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">path&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">convert_coco_json_to_yolo_txt&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">json_file&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">path&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">make_folders&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">with&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">json_file&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">json_data&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">json&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">load&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">f&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># write _darknet.labels, which holds names of all classes (one class per line)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">label_file&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;_darknet.labels&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">with&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">label_file&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;w&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">category&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">tqdm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">json_data&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;categories&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">desc&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;Categories&amp;#34;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">category_name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">category&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">f&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">write&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">category_name&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">image&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">tqdm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">json_data&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;images&amp;#34;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">desc&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;Annotation txt for each iamge&amp;#34;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">img_id&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">image&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">img_name&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">image&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;file_name&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">img_width&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">image&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;width&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">img_height&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">image&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;height&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">anno_in_image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="n">anno&lt;/span> &lt;span class="k">for&lt;/span> &lt;span class="n">anno&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">json_data&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;annotations&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="k">if&lt;/span> &lt;span class="n">anno&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;image_id&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">img_id&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">anno_txt&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">img_name&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">split&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;.&amp;#34;&lt;/span>&lt;span class="p">)[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s2">&amp;#34;.txt&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">with&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">anno_txt&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;w&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">f&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">anno&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">anno_in_image&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">category&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">anno&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;category_id&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">bbox_COCO&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">anno&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;bbox&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">w&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">h&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">convert_bbox_coco2yolo&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">img_width&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">img_height&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">bbox_COCO&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">f&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">write&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">category&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.6f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.6f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">w&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.6f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">h&lt;/span>&lt;span class="si">:&lt;/span>&lt;span class="s2">.6f&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Converting COCO Json to YOLO txt finished!&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="example">Example&lt;/h3>
&lt;p>Assuming we have a COCO Json file &lt;code>_annotations.coco.json&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;info&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;year&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2020&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;version&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;1&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;description&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Exported from roboflow.ai&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;contributor&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Roboflow&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;https://app.roboflow.ai/datasets/hard-hat-sample/1&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;date_created&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2000-01-01T00:00:00+00:00&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;licenses&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;url&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;https://creativecommons.org/publicdomain/zero/1.0/&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Public Domain&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;categories&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Workers&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;none&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;head&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Workers&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;helmet&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Workers&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;person&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;supercategory&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Workers&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;images&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;license&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;file_name&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;0001.jpg&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;height&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">275&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;width&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">490&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;date_captured&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2020-07-20T19:39:26+00:00&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;annotations&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;image_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;category_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;bbox&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">45&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">85&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">85&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;area&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">7225&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;segmentation&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;iscrowd&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">},&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;image_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;category_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;bbox&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">324&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">29&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">72&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="mi">81&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;area&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">5832&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;segmentation&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;iscrowd&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="mi">0&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">convert_coco_json_to_yolo_txt&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;output&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;_annotations.coco.json&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">Categories: 100%|██████████| 4/4 [00:00&amp;lt;00:00, 2471.24it/s]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">Annotation txt for each iamge: 100%|██████████| 1/1 [00:00&amp;lt;00:00, 1800.13it/s]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">Converting COCO Json to YOLO txt finished!
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>An folder named &lt;code>output&lt;/code> is created and has the structure:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">- output
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- 0001.txt
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- _darknet.labels
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Content of &lt;code>_darknet.labels&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">Workers
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">head
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">helmet
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">person
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Content of &lt;code>0001.txt&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">2 0.178571 0.161818 0.173469 0.309091
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">2 0.734694 0.252727 0.146939 0.294545
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Instruction from YOLO v4 repo: &lt;a href="https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects">https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://github.com/AlexeyAB/Yolo_mark/issues/60#issuecomment-401854885">Specific format of annotation&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://www.cnblogs.com/hejunlin1992/p/9925293.html">darknet训练yolov3时的一些注意事项&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://manivannan-ai.medium.com/how-to-train-yolov2-to-detect-custom-objects-9010df784f36">How to train YOLOv2 to detect custom objects&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://roboflow.com/formats">Computer Vision Annotation Formats&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Reference: &lt;a href="https://github.com/AlexeyAB/Yolo_mark/issues/60">https://github.com/AlexeyAB/Yolo_mark/issues/60&lt;/a>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>YOLOv4: Training Tips</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/yolov4-training-tips/</link><pubDate>Sat, 19 Dec 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/yolov4-training-tips/</guid><description>&lt;h2 id="model-zoo">Model zoo&lt;/h2>
&lt;p>&lt;a href="https://github.com/AlexeyAB/darknet/wiki/YOLOv4-model-zoo#yolov4-model-zoo">YOLOv4 model zoo&lt;/a>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Pretrained models&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Proper configuration based on GPU&lt;/p>
&lt;blockquote>
&lt;p>We do NOT suggest you train the model with subdivisions equal or larger than 32, it will takes very long training time.&lt;/p>
&lt;/blockquote>
&lt;/li>
&lt;/ul>
&lt;h2 id="faq">FAQ&lt;/h2>
&lt;h3 id="low-accuracy-1">Low accuracy &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h3>
&lt;h4 id="the-most-common-problem---you-do-not-follow-strictly-the-manual">The most common problem - you do NOT follow strictly the manual.&lt;/h4>
&lt;ul>
&lt;li>You must use
&lt;ul>
&lt;li>&lt;code>default anchors&lt;/code>&lt;/li>
&lt;li>&lt;code>learning_rate=0.001&lt;/code>&lt;/li>
&lt;li>&lt;code>batch=64&lt;/code>&lt;/li>
&lt;li>&lt;code>max_batches = max(6000, number_of_training_images, 2000*classes)&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>You can only change &lt;code>subdivisions&lt;/code>&lt;/li>
&lt;li>&lt;strong>Do not do anything that is not written in the manual.&lt;/strong> 🙅‍♂️&lt;/li>
&lt;/ul>
&lt;h4 id="your-datasets-are-wrong">Your datasets are wrong.&lt;/h4>
&lt;ul>
&lt;li>
&lt;p>check the AP50 (average precision) for validation and training dataset by using &lt;code>./darknet detector map obj.data yolo.cfg yolo.weights&lt;/code>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>If you get high mAP for both Training and Validation datasets, but the network detects objects poorly in real life, then your training dataset is not representative &amp;ndash;&amp;gt; &lt;strong>add more images from real life to it&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>If you get high mAP for Training dataset, but low for Validation dataset, then your Training dataset isn&amp;rsquo;t suitable for Validation dataset.&lt;/p>
&lt;p>For example&lt;/p>
&lt;ul>
&lt;li>Training dataset contains: cars (rear view) from distance 100m&lt;/li>
&lt;li>Test dataset contains: cars (side view) from distance 5m&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>if you get low mAP for both Training and Validation datasets, then labels in your Training dataset are wrong&lt;/p>
&lt;ul>
&lt;li>Run training with flag &lt;code>-show_imgs&lt;/code>, i.e. &lt;code>./darknet detector train ... -show_imgs&lt;/code> , do you see correct bounded boxes?&lt;/li>
&lt;li>Or check your dataset by using &lt;a href="https://github.com/AlexeyAB/Yolo_mark">Yolo_mark&lt;/a> tool&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="darknet-trainingdetection-crashes-with-an-error-2">Darknet training/detection crashes with an error &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h3>
&lt;ul>
&lt;li>If &lt;code>CUDA Out of memory&lt;/code> error occurs, then increase &lt;code>subdivisions=&lt;/code> 2 times in cfg-file, but not higher than &lt;code>batch=&lt;/code> (don&amp;rsquo;t change batch)!
&lt;ul>
&lt;li>If it doesn&amp;rsquo;t help - set &lt;code>random=0&lt;/code> and &lt;code>width=416 height=416&lt;/code> in cfg-file.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Check content of files &lt;code>bad.list&lt;/code> and &lt;code>bad_label.list&lt;/code> if they exist near with &lt;code>./darknet&lt;/code> executable file.&lt;/li>
&lt;li>Do not move some files from Darknet folder - you may forget the necessary files.&lt;/li>
&lt;li>Download libraries CUDA, cuDNN, OpenCV, &amp;hellip; only from official sources. Don&amp;rsquo;t download libs from other sites.&lt;/li>
&lt;li>Make sure that you do everything in accordance with the manual, and do not do anything that is not written in the manual.&lt;/li>
&lt;/ul>
&lt;h2 id="train-with-multiple-gpus-3">Train with multiple GPUs &lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>Train it first on 1 GPU for like 1000 iterations:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">./darknet detector train cfg/coco.data cfg/yolov4.cfg yolov4.conv.137
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>Then stop and by using partially-trained model &lt;code>/backup/yolov4_1000.weights&lt;/code>. Run training with multigpu (up to 4 GPUs): &lt;code>./darknet detector train cfg/coco.data cfg/yolov4.cfg /backup/yolov4_1000.weights -gpus 0,1,2,3&lt;/code>&lt;/p>
&lt;blockquote>
&lt;p>If you get a Nan, then for some datasets better to decrease learning rate, for 4 GPUs set &lt;code>learning_rate = 0,00065&lt;/code> (i.e. learning_rate = 0.00261 / GPUs). In this case also increase 4x times &lt;code>burn_in =&lt;/code> in your cfg-file. I.e. use &lt;code>burn_in = 4000&lt;/code> instead of &lt;code>1000&lt;/code>.&lt;/p>
&lt;/blockquote>
&lt;/li>
&lt;/ol>
&lt;h2 id="train-custom-datasets">Train custom datasets&lt;/h2>
&lt;p>Configuration setup see: &lt;a href="https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/train-yolo-v4-custom-dataset/">Train YOLO v4 on Custom Dataset&lt;/a>&lt;/p>
&lt;p>Start training:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">./darknet detector train data/obj.data &amp;lt;custom-cfg&amp;gt; yolov4.conv.137
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>
&lt;p>File &lt;code>&amp;lt;custom-cfg&amp;gt;_last.weights&lt;/code> will be saved to &lt;code>backup/&lt;/code> for each 100 iterations&lt;/p>
&lt;/li>
&lt;li>
&lt;p>File &lt;code>&amp;lt;custom-cfg&amp;gt;_xxxx.weights&lt;/code> will be saved to &lt;code>backup/&lt;/code> for each 1000 iterations&lt;/p>
&lt;/li>
&lt;li>
&lt;p>if you train on server without monitor, disable Loss-window by using argument &lt;code>--dont_show&lt;/code>. I.e.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">./darknet detector train data/obj.data &amp;lt;custom-cfg&amp;gt; yolov4.conv.137 -dont_show
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>To see the mAP &amp;amp; Loss-chart during training on remote server without GUI, use&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">./darknet detector train data/obj.data &amp;lt;custom-cfg&amp;gt; yolov4.conv.137 -dont_show -mjpeg_port &lt;span class="m">8090&lt;/span> -map
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Then open URL &lt;code>http://ip-address:8090&lt;/code> in browser&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For training with mAP calculation for each 4 Epochs, you need to&lt;/p>
&lt;ul>
&lt;li>
&lt;p>set &lt;code>valid=valid.txt&lt;/code> or &lt;code>train.txt&lt;/code> in &lt;code>obj.data&lt;/code> file&lt;/p>
&lt;/li>
&lt;li>
&lt;p>run training with &lt;code>-map&lt;/code> argument&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">./darknet detector train data/obj.data &amp;lt;custom-cfg&amp;gt; yolov4.conv.137 -map
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>After training is complete - get result &lt;code>yolo-obj_final.weights&lt;/code> from &lt;code>backup/&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>After each 100 iterations you can stop and later start training from this point. For example, after 2000 iterations you can stop training, and later just start training using:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">./darknet detector train data/obj.data &amp;lt;custom-cfg&amp;gt; backup/yolo-obj_2000.weights
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>You can get result earlier than all 45000 iterations.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="notes-">Notes 📝&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>If during training you see &lt;code>nan&lt;/code> values for &lt;code>avg&lt;/code> (loss) field, then training goes wrong. 😭&lt;/p>
&lt;p>But if &lt;code>nan&lt;/code> is in some other lines, then training goes well. 🙏&lt;/p>
&lt;/li>
&lt;li>
&lt;p>If you changed &lt;code>width=&lt;/code> or &lt;code>height=&lt;/code> in your cfg-file, then new width and height must be &lt;strong>divisible by 32&lt;/strong>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>If error &lt;code>Out of memory&lt;/code> occurs then in &lt;code>.cfg&lt;/code>-file you should increase &lt;code>subdivisions=16&lt;/code>, 32 or 64&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="when-should-i-stop-training-4">When should I stop training &lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Usually sufficient 2000 iterations for each class(object),&lt;/p>
&lt;ul>
&lt;li>but NOT less than number of training images and&lt;/li>
&lt;li>NOT less than 6000 iterations in total.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>During training, you will see varying indicators of error, and you should stop when no longer decreases &lt;strong>0.XXXXXXX avg&lt;/strong>&lt;/p>
&lt;blockquote>
&lt;p>For example&lt;/p>
&lt;p>&lt;strong>9002&lt;/strong>: 0.211667, &lt;strong>0.60730 avg&lt;/strong>, 0.001000 rate, 3.868000 seconds, 576128 images Loaded: 0.000000 seconds&lt;/p>
&lt;ul>
&lt;li>&lt;strong>9002&lt;/strong> - iteration number (number of batch)&lt;/li>
&lt;li>&lt;strong>0.60730 avg&lt;/strong> - average loss (error) - &lt;strong>the lower, the better&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;p>he final avgerage loss can be from &lt;code>0.05&lt;/code> (for a small model and easy dataset) to &lt;code>3.0&lt;/code> (for a big model and a difficult dataset).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>if you train with flag &lt;code>-map&lt;/code> then you will see mAP indicator like &lt;code>Last accuracy mAP@0.5 = 18.50%&lt;/code> in the console. This indicator is better than Loss, so keep training while mAP increases.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="choose-the-best-weights">Choose the best weights&lt;/h2>
&lt;p>Once training is stopped, you should take some of last &lt;code>.weights&lt;/code>-files from &lt;code>backup/&lt;/code> and choose the best of them.&lt;/p>
&lt;p>&lt;em>For example, you stopped training after 9000 iterations, but the best result can give one of previous weights (7000, 8000, 9000). It can happen due to overfitting.&lt;/em>&lt;/p>
&lt;p>In order to choose best weight, just train with &lt;code>-map&lt;/code> flag&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">./darknet detector train data/obj.data &amp;lt;custom-cfg&amp;gt; yolov4.conv.137 -dont_show -map
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>So you will see mAP-chart (red-line) in the Loss-chart Window looks like the following figure. mAP will be calculated for each 4 Epochs using &lt;code>valid=valid.txt&lt;/code> file that is specified in &lt;code>obj.data&lt;/code> file (&lt;code>1 Epoch = images_in_train_txt / batch&lt;/code> iterations)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/68747470733a2f2f6873746f2e6f72672f776562742f79642f766c2f61672f7964766c616775746f66327a636e6a6f64737467726f656e3861632e6a706567.jpeg" alt="loss_chart_map_chart">&lt;/p>
&lt;h2 id="how-to-improve-object-detection-5">How to improve object detection&lt;sup id="fnref:5">&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref">5&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>Before training&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Set flag &lt;code>random=1&lt;/code> in your &lt;code>.cfg&lt;/code>-file - it will increase precision by training Yolo for different resolutions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>increase network resolution in your &lt;code>.cfg&lt;/code>-file (&lt;code>height=608&lt;/code>, &lt;code>width=608&lt;/code> or any value multiple of 32) - it will increase precision&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Check that each object that you want to detect is mandatory labeled in your dataset - no one object in your data set should not be without label.&lt;/p>
&lt;ul>
&lt;li>In the most training issues, there are wrong labels in your dataset. Always check your dataset by using: &lt;a href="https://github.com/AlexeyAB/Yolo_mark">https://github.com/AlexeyAB/Yolo_mark&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>My Loss is very high and mAP is very low, is training wrong?&lt;/p>
&lt;p>&amp;ndash;&amp;gt; Run training with &lt;code>-show_imgs&lt;/code> flag at the end of training command, do you see correct bounded boxes of objects? If no, your training dataset is wrong.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For each object which you want to detect - there must be &lt;strong>at least 1 similar object&lt;/strong> in the Training dataset with about the same: shape, side of object, relative size, angle of rotation, tilt, illumination.&lt;/p>
&lt;ul>
&lt;li>So desirable that your training dataset include images with objects at diffrent: scales, rotations, lightings, from different sides, on different backgrounds&lt;/li>
&lt;li>You should preferably have 2000 different images for each class or more, and you should train &lt;code>2000*classes&lt;/code> iterations or more&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Desirable that your training dataset include images with non-labeled objects that you do not want to detect, i.e. negative samples without bounded box (empty &lt;code>.txt&lt;/code> files). Use as many images of negative samples as there are images with objects.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>More see: &lt;a href="https://github.com/AlexeyAB/darknet#how-to-improve-object-detection">https://github.com/AlexeyAB/darknet#how-to-improve-object-detection&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>After training, for detection:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Increase network-resolution by set in your &lt;code>.cfg&lt;/code>-file (&lt;code>height=608&lt;/code> and &lt;code>width=608&lt;/code>) or (&lt;code>height=832&lt;/code> and &lt;code>width=832&lt;/code>) or (any value multiple of 32). This increases the precision and makes it possible to detect small objects.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>It is not necessary to train the network again, just use &lt;code>.weights&lt;/code>-file already trained for 416x416 resolution&lt;/p>
&lt;/li>
&lt;li>
&lt;p>To get even greater accuracy you should train with higher resolution 608x608 or 832x832.&lt;/p>
&lt;ul>
&lt;li>Note: if error &lt;code>Out of memory&lt;/code> occurs then in &lt;code>.cfg&lt;/code>-file you should increase &lt;code>subdivisions=16&lt;/code>, 32 or 64&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="other-questions">Other questions&lt;/h2>
&lt;h3 id="will-darknet-automaticly-resize-the-image-size">Will darknet automaticly resize the image size?&lt;/h3>
&lt;p>Yes (see: &lt;a href="https://github.com/AlexeyAB/darknet/issues/5842">https://github.com/AlexeyAB/darknet/issues/5842&lt;/a>)&lt;/p>
&lt;h3 id="does-the-network-have-to-be-perfectly-square">Does the network have to be perfectly square?&lt;/h3>
&lt;blockquote>
&lt;p>No.&lt;/p>
&lt;p>The default network sizes in the common template configuration files is defined as 416x416 or 608x608, but &lt;em>those are only examples!&lt;/em>&lt;/p>
&lt;p>Choose a size that works for you and your images. The only restrictions are:&lt;/p>
&lt;ul>
&lt;li>the width has to be evenly divisible by 32&lt;/li>
&lt;li>the height has to be evenly divisible by 32&lt;/li>
&lt;li>you must have enough video memory to train a network of that size&lt;/li>
&lt;/ul>
&lt;p>Whatever size you choose, Darknet will stretch (without preserving the aspect ratio!) your images to be exactly that size prior to processing the image. This includes both training and inference. So use a size that makes sense for you and the images you need to process, but remember that there are important speed and memory limitations. The larger the size, the slower it will be to train and run, and the more GPU memory will be required.&lt;/p>
&lt;/blockquote>
&lt;p>See:&lt;/p>
&lt;p>&lt;a href="https://www.ccoderun.ca/programming/2020-09-25_Darknet_FAQ/#square_network">https://www.ccoderun.ca/programming/2020-09-25_Darknet_FAQ/#square_network&lt;/a>&lt;/p>
&lt;h3 id="detection-with-aspect-ratio-change">Detection with aspect ratio change&lt;/h3>
&lt;ol>
&lt;li>First of all, the high network resolution is important (the higher - the better). I.e. 800 x 800 will be better than 736 x 416, even if your input image 1600 x 900.&lt;/li>
&lt;li>And only In second place in importance is the aspect ratio.&lt;/li>
&lt;/ol>
&lt;p>See: &lt;a href="https://github.com/AlexeyAB/darknet/issues/131">https://github.com/AlexeyAB/darknet/issues/131&lt;/a>&lt;/p>
&lt;h2 id="useful-resources">Useful resources&lt;/h2>
&lt;ul>
&lt;li>Tips from Roboflow: &lt;a href="https://blog.roboflow.com/yolov4-tactics/">YOLOv4 - Ten Tactics to Build a Better Model&lt;/a>&lt;/li>
&lt;li>Articles from Aleksey Bochkovskiy (author of YOLOv4)
&lt;ul>
&lt;li>&lt;strong>&lt;a href="https://alexeyab84.medium.com/yolov4-the-most-accurate-real-time-neural-network-on-ms-coco-dataset-73adfd3602fe">YOLOv4 — the most accurate real-time neural network on MS COCO dataset.&lt;/a>&lt;/strong>&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://alexeyab84.medium.com/scaled-yolo-v4-is-the-best-neural-network-for-object-detection-on-ms-coco-dataset-39dfa22fa982">Scaled YOLO v4 is the best neural network for object detection on MS COCO dataset&lt;/a>&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="https://www.ccoderun.ca/programming/2020-09-25_Darknet_FAQ/#how_to_get_started">DARKNET FAQ&lt;/a>&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>&lt;a href="https://github.com/AlexeyAB/darknet/wiki/FAQ---frequently-asked-questions#1-i-get-low-accuracy">FAQ: I get low accuracy&lt;/a>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>&lt;a href="https://github.com/AlexeyAB/darknet/wiki/FAQ---frequently-asked-questions#2-darknet-trainingdetection-crashes-with-an-error">FAQ: Darknet training/detection crashes with an error&lt;/a>&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>&lt;a href="https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu">How to train with multi-GPU&lt;/a>&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4">
&lt;p>&lt;a href="https://github.com/AlexeyAB/darknet#when-should-i-stop-training">When should I stop training&lt;/a>&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:5">
&lt;p>&lt;a href="https://github.com/AlexeyAB/darknet#how-to-improve-object-detection">How to improve object detection&lt;/a>&amp;#160;&lt;a href="#fnref:5" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>YOLOv5: Train Custom Dataset</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/yolo-v5/</link><pubDate>Fri, 25 Dec 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/yolo-v5/</guid><description>&lt;p>We will learn&lt;/p>
&lt;ul>
&lt;li>training YOLOv5 on our custom dataset&lt;/li>
&lt;li>visualizing training logs&lt;/li>
&lt;li>using trained YOLOv5 for inference&lt;/li>
&lt;li>exporting trained YOLOv5 from PyTorch to other formats.&lt;/li>
&lt;/ul>
&lt;br>
&lt;h2 id="clone-yolov5-and-install-dependencies">Clone YOLOv5 and install dependencies&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">git clone https://github.com/ultralytics/yolov5
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">cd&lt;/span> yolov5
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">pip install -r requirements.txt
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="prepare-custom-datasets">Prepare custom datasets&lt;/h2>
&lt;h3 id="yolo-darknet-format">YOLO darknet format&lt;/h3>
&lt;p>Dataset in &lt;a href="https://github.com/AlexeyAB/Yolo_mark/issues/60#issuecomment-401854885">YOLO darknet format&lt;/a> has the following structure:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>There&amp;rsquo;s a file name &lt;code>_darknet.labels&lt;/code> containing object names (one name per line).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For each &lt;code>.img&lt;/code> file, there is a corresponding &lt;code>.txt&lt;/code> file (same name, but with &lt;code>.txt&lt;/code>-extension) in the same directory. I.e.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">dataset
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">|- train
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- _darknet.labels
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- train_img_001.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- train_img_001.txt
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ...
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- train_img_xxx.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- train_img_xxx.txt
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">|- valid # similar structure as train
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">|- test # similar structure as train
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>The &lt;code>*.txt&lt;/code> file specifications are:&lt;/p>
&lt;ul>
&lt;li>One row per object&lt;/li>
&lt;li>Each row is &lt;code>class x_center y_center width height&lt;/code> format.&lt;/li>
&lt;li>Box coordinates must be in &lt;strong>normalized xywh&lt;/strong> format (from 0 - 1). If your boxes are in pixels, divide &lt;code>x_center&lt;/code> and &lt;code>width&lt;/code> by image width, and &lt;code>y_center&lt;/code> and &lt;code>height&lt;/code> by image height.&lt;/li>
&lt;li>Class numbers are zero-indexed (start from 0).&lt;/li>
&lt;/ul>
&lt;p>For example &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/91506361-c7965000-e886-11ea-8291-c72b98c25eec.jpg" alt="Image Labels">&lt;/p>
&lt;p>The label file corresponding to the above image contains 2 persons (class &lt;code>0&lt;/code>) and a tie (class &lt;code>27&lt;/code>):&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/98809572-0bc4d580-241e-11eb-844e-eee756f878c2.png" alt="img" style="zoom: 67%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="yolov5-format">YOLOv5 format&lt;/h3>
&lt;p>&lt;a href="https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data">YOLOv5 format&lt;/a>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>If no objects in image, no &lt;code>*.txt&lt;/code> file is required&lt;/p>
&lt;/li>
&lt;li>
&lt;p>YOLOv5 locates labels automatically for each image by replacing the last instance of &lt;strong>/images/&lt;/strong> in the images directory with &lt;strong>/labels/&lt;/strong>. Therefore, folder structure of dataset should look like below:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">dataset
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">|- images
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- train
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- train_img_001.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ...
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- train_img_xxx.jpg
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- valid
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- test
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">|- labels
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- train
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- train_img_001.txt
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ...
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- train_img_xxx.txt
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- valid
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> |- test
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;h3 id="yolo-darknet-format----yolov5-format">YOLO darknet format &amp;ndash;&amp;gt; YOLOv5 format&lt;/h3>
&lt;p>Assuming we have a dataset in YOLO darknet format, we want to convert it to YOLOv5 format.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">pathlib&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">Path&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">shutil&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">rmtree&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">copy2&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">tqdm&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">tqdm&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">copy_files&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">src_dir&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dest_dir&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ext&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Copy files with the same extension from source directory to destination directory
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Parameters
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> ----------
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> src_dir : str
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> source directory
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> dest_dir : str
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> destination directory
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> ext : str, optional
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> extension of files to be moved, by default &amp;#34;jpg&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">file&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">tqdm&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">Path&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">src_dir&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">glob&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;*.&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">ext&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">desc&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Copying .&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">ext&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> files from &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">src_dir&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> to &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">dest_dir&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">copy2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">file&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dest_dir&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">convert_dataset_darknet_to_yolov5&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">src_dir_darknet&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dest_dir_yolov5&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dataset_types&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;train&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;valid&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;test&amp;#34;&lt;/span>&lt;span class="p">]):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Convert dataset from YOLO darknet format to scaled YOLOv4 format
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Parameters
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> ----------
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> src_dir_darknet : str
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> source dataset in YOLO darknet format
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> dest_dir_scaled_yolov4 : str
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> destination dataset in scaled YOLOv4 format
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> dataset_types : list, optional
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> types of dataset, by default [&amp;#34;train&amp;#34;, &amp;#34;valid&amp;#34;]
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dest_dir_yolov5&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Path&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dest_dir_yolov5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">dest_dir_yolov5&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">exists&lt;/span>&lt;span class="p">():&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">rmtree&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">dest_dir_yolov5&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dest_dir_yolov5&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mkdir&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="nb">dir&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;images&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;labels&amp;#34;&lt;/span>&lt;span class="p">]:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">dataset_type&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">dataset_types&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dest_dir&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">dest_dir_yolov5&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">joinpath&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">dir&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">dataset_type&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dest_dir&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">mkdir&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">parents&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">src_dir&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Path&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">src_dir_darknet&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">joinpath&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">dataset_type&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">ext&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;jpg&amp;#34;&lt;/span> &lt;span class="k">if&lt;/span> &lt;span class="nb">dir&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s2">&amp;#34;images&amp;#34;&lt;/span> &lt;span class="k">else&lt;/span> &lt;span class="s2">&amp;#34;txt&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">copy_files&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">src_dir&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dest_dir&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ext&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">ext&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;Copy &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="nb">dir&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> from &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">src_dir&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> to &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">dest_dir&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> done!&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="define-training-configuration">Define training configuration&lt;/h2>
&lt;p>For training we need to configure a &lt;code>.yaml&lt;/code> file which specifies&lt;/p>
&lt;ul>
&lt;li>
&lt;p>download commands/URL for auto-downloading (optional)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>the path of training and validation folder&lt;/p>
&lt;/li>
&lt;li>
&lt;p>number of classes&lt;/p>
&lt;/li>
&lt;li>
&lt;p>classes names&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>and &lt;strong>put this &lt;code>.yaml&lt;/code> file in &lt;code>yolov5/data/&lt;/code>.&lt;/strong>&lt;/p>
&lt;p>For example, let&amp;rsquo;s say we have &lt;code>custom-dataset&lt;/code> folder in YOLOv5 format next to &lt;code>yolov5&lt;/code>. This custom dataset containes 3 object classes: &amp;ldquo;cat&amp;rdquo;, &amp;ldquo;dog&amp;rdquo;, &amp;ldquo;monkey&amp;rdquo;.&lt;/p>
&lt;p>Then &lt;code>yolov5/data/custom-dataset.yaml&lt;/code> should look like:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-yaml" data-lang="yaml">&lt;span class="line">&lt;span class="cl">&lt;span class="nt">train&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">../custom-dataset/images/train&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">valid&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="l">../custom-dataest/images/valid&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">nc&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="m">3&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="nt">names&lt;/span>&lt;span class="p">:&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;cat&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;dog&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="s2">&amp;#34;monkey&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="select-a-model">Select a model&lt;/h2>
&lt;p>Select a pretrained model to start training from &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/97808084-edfcb100-1c64-11eb-83eb-ffed43a0859f.png" alt="YOLOv5 Models" style="zoom: 50%;" />
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Model&lt;/th>
&lt;th>APval&lt;/th>
&lt;th>APtest&lt;/th>
&lt;th>AP50&lt;/th>
&lt;th>SpeedGPU&lt;/th>
&lt;th>FPSGPU&lt;/th>
&lt;th>&lt;/th>
&lt;th>params&lt;/th>
&lt;th>GFLOPS&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;a href="https://github.com/ultralytics/yolov5/releases">YOLOv5s&lt;/a>&lt;/td>
&lt;td>37.0&lt;/td>
&lt;td>37.0&lt;/td>
&lt;td>56.2&lt;/td>
&lt;td>&lt;strong>2.4ms&lt;/strong>&lt;/td>
&lt;td>&lt;strong>416&lt;/strong>&lt;/td>
&lt;td>&lt;/td>
&lt;td>7.5M&lt;/td>
&lt;td>17.5&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;a href="https://github.com/ultralytics/yolov5/releases">YOLOv5m&lt;/a>&lt;/td>
&lt;td>44.3&lt;/td>
&lt;td>44.3&lt;/td>
&lt;td>63.2&lt;/td>
&lt;td>3.4ms&lt;/td>
&lt;td>294&lt;/td>
&lt;td>&lt;/td>
&lt;td>21.8M&lt;/td>
&lt;td>52.3&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;a href="https://github.com/ultralytics/yolov5/releases">YOLOv5l&lt;/a>&lt;/td>
&lt;td>47.7&lt;/td>
&lt;td>47.7&lt;/td>
&lt;td>66.5&lt;/td>
&lt;td>4.4ms&lt;/td>
&lt;td>227&lt;/td>
&lt;td>&lt;/td>
&lt;td>47.8M&lt;/td>
&lt;td>117.2&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;a href="https://github.com/ultralytics/yolov5/releases">YOLOv5x&lt;/a>&lt;/td>
&lt;td>&lt;strong>49.2&lt;/strong>&lt;/td>
&lt;td>&lt;strong>49.2&lt;/strong>&lt;/td>
&lt;td>&lt;strong>67.7&lt;/strong>&lt;/td>
&lt;td>6.9ms&lt;/td>
&lt;td>145&lt;/td>
&lt;td>&lt;/td>
&lt;td>89.0M&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>For example, we select YOLOv5s, the smallest and fastest model available. (YOLOv5m, YOLOv5l, YOLOv5x work similarly.)&lt;/p>
&lt;p>In order to use YOLOv5s for training on custom dataset, we need to adjust &lt;code>models/yolov5s.yaml&lt;/code>: &lt;strong>change number of class &lt;code>nc&lt;/code> according to our custom dataset.&lt;/strong> Following the example above, the value of &lt;code>nc&lt;/code> is 3.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">models_dir&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="s2">&amp;#34;yolov5/models&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">yolov5s&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">models_dir&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;yolov5s.yaml&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">yolov5s_custom&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">join&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">models_dir&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;yolov5s_custom.yaml&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">num_class&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">3&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">with&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yolov5s&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;r&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">reader&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">yolov5s_custom&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;w&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="n">writer&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lines&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">reader&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">readlines&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># change number of classes according to custom dataset&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">lines&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;nc: &lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">num_class&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2"> # number of classes&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">writer&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">writelines&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">lines&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="train">Train&lt;/h2>
&lt;p>Now we&amp;rsquo;re ready for training YOLOv5 on our custom dataset.&lt;/p>
&lt;p>To kick off training, we execute &lt;code>train.py&lt;/code> with the following options:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>img:&lt;/strong> define input image size&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>batch:&lt;/strong> determine batch size&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>epochs:&lt;/strong> define the number of training epochs. (Note: often, 3000+ are common here!)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>data:&lt;/strong> set the path to our yaml file&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>cfg:&lt;/strong> specify our model configuration&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>weights:&lt;/strong> specify a custom path to weights.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Use pretrained weights (recommended): &lt;code>--weights yolov5s.pt&lt;/code>&lt;/p>
&lt;p>(Pretrained weights are auto-downloaded from the latest YOLOv5 release.)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use randomly initialized weights (NOT recommended!): &lt;code>--weights ''&lt;/code>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>name:&lt;/strong> result names&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>nosave:&lt;/strong> only save the final checkpoint&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>cache:&lt;/strong> cache images for faster training&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">python train.py --img &lt;span class="m">416&lt;/span> --batch &lt;span class="m">16&lt;/span> --epochs &lt;span class="m">1000&lt;/span> --data ./data/masks.yaml --cfg ./models/yolov5s_masks.yaml --weights yolov5s.pt --cache-images
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="training-logging">Training logging&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>All training results are saved to &lt;code>runs/train/&lt;/code> with incrementing run directories, i.e. &lt;code>runs/train/exp&lt;/code>, &lt;code>runs/train/exp1&lt;/code>, &lt;code>runs/train/exp2&lt;/code>, etc.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>We can view training losses and performance metrics using &lt;strong>Tensorboard&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>If training on Google Colab:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">%load_ext tensorboard
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">%tensorboard --logdir runs
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Training losses and performance metrics are also saved to a logfile.&lt;/p>
&lt;ul>
&lt;li>If given no name, it defaults to &lt;code>results.txt&lt;/code>. We can also specify the name with &lt;code>--name&lt;/code> flag when we train.&lt;/li>
&lt;li>&lt;code>results.png&lt;/code> contains plotting of different metrics&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="run-inference-with-trained-weights">Run inference with trained weights&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Trained weights are saved by default in &lt;code>runs/train/exp/weights&lt;/code> folder.&lt;/p>
&lt;ul>
&lt;li>The best weights &lt;code>best.pt&lt;/code> and the last weights &lt;code>last.pt&lt;/code> are saved&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>For inference we use &lt;code>detect.py&lt;/code>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">python detect.py --weights ./runs/train/exp/weights/best.pt --img &lt;span class="m">416&lt;/span> --conf-thres 0.5 --source &amp;lt;path-to-test-set&amp;gt;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;/ul>
&lt;h2 id="export-a-trained-yolov5-model">Export a trained YOLOv5 model&lt;/h2>
&lt;ul>
&lt;li>Install dependencies&lt;/li>
&lt;li>Use &lt;code>models/export.py&lt;/code> to export to ONNX, TorchScript and CoreML formats&lt;/li>
&lt;/ul>
&lt;h2 id="google-colab-notebook">Google Colab Notebook&lt;/h2>
&lt;p>Open in &lt;a href="https://colab.research.google.com/drive/1lu3sSPWUzuxJTMqcwdFTC-iXvagIAKk2">Colab&lt;/a>&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>YOLOv5 repo: &lt;a href="https://github.com/ultralytics">ultralytics&lt;/a>/&lt;strong>&lt;a href="https://github.com/ultralytics/yolov5">yolov5&lt;/a>&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Developed actively&lt;/li>
&lt;li>&lt;a href="https://github.com/ultralytics/yolov5/wiki">Tutorials&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Tutorials&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://github.com/ultralytics/yolov5/wiki">Official tutorials&lt;/a> from YOLOv5 repo&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data">Train Custom Data&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/ultralytics/yolov5/issues/251">ONNX and TorchScript Export&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Roboflow tutorials&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Blog post: &lt;a href="https://blog.roboflow.com/how-to-train-yolov5-on-a-custom-dataset/">How to Train YOLOv5 On a Custom Dataset&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://colab.research.google.com/drive/1gDZ2xcTOgR39tGGs-EZ6i3RTs16wmzZQ">Google Colab Notebook&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Video tutorial&lt;/p>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/MdF6x6ZmLAY?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
>&lt;/iframe>
&lt;/div>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Very detailed tutorial and explanation: &lt;a href="https://blog.csdn.net/g11d111/article/details/108872076">Yolov5 系列2&amp;mdash; 如何使用Yolov5训练你自己的数据集&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>YOLOv5 explanation: &lt;a href="https://www.xiaoheidiannao.com/211455.html">深入浅出Yolo系列之Yolov5核心基础知识完整讲解&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>&lt;a href="https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data#2-create-labels">https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data#2-create-labels&lt;/a>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>&lt;a href="https://github.com/ultralytics/yolov5#pretrained-checkpoints">https://github.com/ultralytics/yolov5#pretrained-checkpoints&lt;/a>&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Scaled YOLOv4</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/scaled-yolo-v4/</link><pubDate>Tue, 05 Jan 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/scaled-yolo-v4/</guid><description>&lt;p>Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao (more commonly known by their GitHub monikers, &lt;a href="https://github.com/WongKinYiu">WongKinYiu&lt;/a> and &lt;a href="https://github.com/AlexeyAB">AlexyAB&lt;/a>) have propelled the YOLOv4 model forward by efficiently scaling the network&amp;rsquo;s design and scale, surpassing the previous state-of-the-art EfficientDet published earlier this year by the Google Research/Brain team.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image.png" alt="img" style="zoom:80%;" />
&lt;h2 id="train-scaled-yolov4-pytorch">Train scaled YOLOv4 (PyTorch)&lt;/h2>
&lt;p>The Scaled-YOLOv4 implementation is written in the YOLOv5 PyTorch framework. Training scaled YOLOv4 is similar to &lt;a href="https://haobin-tan.netlify.app/tags/yolov5/">training YOLOv5&lt;/a>.&lt;/p>
&lt;blockquote>
&lt;p>Here is &lt;a href="https://github.com/WongKinYiu/ScaledYOLOv4/blob/yolov4-large/models/yolov4-csp.yaml">t&lt;/a>&lt;a href="https://github.com/WongKinYiu/ScaledYOLOv4/blob/yolov4-large/models/yolov4-csp.yaml">he Scaled-YOLOv4 repo&lt;/a>, though you will notice that &lt;a href="https://github.com/WongKinYiu">WongKinYiu&lt;/a> has provided it there predominantly for research replication purposes and there are not many instructions for training on your own dataset. To train on your own data, our guide on &lt;a href="https://blog.roboflow.com/how-to-train-yolov5-on-a-custom-dataset/">training YOLOv5 in PyTorch on custom data&lt;/a> will be useful, as it is a very similar training procedure.&lt;/p>
&lt;/blockquote>
&lt;p>Tutorials from Roboflow:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Video tutorial:&lt;/p>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/rEbpKxZbvIo?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
>&lt;/iframe>
&lt;/div>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://blog.roboflow.com/how-to-train-scaled-yolov4/">Blog post&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://colab.research.google.com/drive/1LDmg0JRiC2N7_tx8wQoBzTB0jUZhywQr?usp=sharing">Google Colab Notebook&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>My Colab Notebook: &lt;a href="https://colab.research.google.com/drive/1GfOzuMCpIcg1luILv7rehfY3Hk4p4SWc">yolov4_scaled.ipynb&lt;/a>&lt;/p>
&lt;h2 id="train-scaled-yolov4-darknet">Train scaled YOLOv4 (Darknet)&lt;/h2>
&lt;p>YOLOv4-csp training is also supported by &lt;a href="https://github.com/AlexeyAB/darknet#pre-trained-models">Darknet&lt;/a>. Training yolov4-csp is similar to training yolov4 and yolov4-tiny. Slight difference:&lt;/p>
&lt;ul>
&lt;li>For config file, use &lt;a href="https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov4-csp.cfg">yolov4-csp.cfg&lt;/a>&lt;/li>
&lt;li>For pretrained weights, use &lt;a href="https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-csp.weights">yolov4-csp.weights&lt;/a>&lt;/li>
&lt;li>For pretrained convolutional layer weights, use &lt;a href="https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-csp.conv.142">yolov4-csp.conv.142&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Scaled YOLOv4 &lt;a href="https://arxiv.org/abs/2011.08036">paper&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Github repo: &lt;a href="https://github.com/WongKinYiu">WongKinYiu&lt;/a>/&lt;strong>&lt;a href="https://github.com/WongKinYiu/ScaledYOLOv4">ScaledYOLOv4&lt;/a>&lt;/strong> (Different size of model in different branch)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Blog post from AlexAB: &lt;a href="https://alexeyab84.medium.com/scaled-yolo-v4-is-the-best-neural-network-for-object-detection-on-ms-coco-dataset-39dfa22fa982">Scaled YOLO v4 is the best neural network for object detection on MS COCO dataset&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Tutorials blog posts:&lt;/p>
&lt;ul>
&lt;li>robolfow: &lt;a href="https://blog.roboflow.com/scaled-yolov4-tops-efficientdet/">Scaled-YOLOv4 is Now the Best Model for Object Detection&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://bbs.cvmart.net/articles/3674">YOLOv4 团队最新力作！1774fps、COCO 最佳精度，分别适合高低端 GPU 的 YOLO&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://zhuanlan.zhihu.com/p/299385758">上达最高精度，下到最快速度，Scaled-YOLOv4：模型缩放显神威&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>YOLOv3: Train on Custom Dataset</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/train-yolo-v3/</link><pubDate>Tue, 05 Jan 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/train-yolo-v3/</guid><description>&lt;p>Training YOLOv3 as well as YOLOv3 tiny on custom dataset is similar to &lt;a href="https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/train-yolo-v4-custom-dataset/">training YOLOv4 and YOLOv4 tiny&lt;/a>. Only some steps need to be adjusted for YOLOv3 and YOLOv3 tiny:&lt;/p>
&lt;ul>
&lt;li>In step 1, we create our custom config file based on &lt;strong>cfg/yolov3.cfg&lt;/strong> (YOLOv3) and &lt;strong>cfg/yolov3-tiny.cfg&lt;/strong> (YOLOv3 tiny). Then adjust &lt;code>batch&lt;/code>, &lt;code>subdivisions&lt;/code>, &lt;code>steps&lt;/code>, &lt;code>width&lt;/code>, &lt;code>height&lt;/code>, &lt;code>classes&lt;/code>, and &lt;code>filters&lt;/code> just as for YOLOv4.&lt;/li>
&lt;li>In step 6, download different pretrained weights for the convolutional layers
&lt;ul>
&lt;li>for &lt;code>yolov3.cfg, yolov3-spp.cfg&lt;/code> (154 MB): &lt;a href="https://pjreddie.com/media/files/darknet53.conv.74">darknet53.conv.74&lt;/a>&lt;/li>
&lt;li>for &lt;code>yolov3-tiny-prn.cfg , yolov3-tiny.cfg&lt;/code> (6 MB): &lt;a href="https://drive.google.com/file/d/18v36esoXCh-PsOKwyP2GWrpYDptDY8Zf/view?usp=sharing">yolov3-tiny.conv.11&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Tutorial from darknet repo: &lt;a href="https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects">How to train (to detect your custom objects)&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://thebinarynotes.com/how-to-train-yolov3-custom-dataset/">How to train YOLOv3 on the custom dataset&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul></description></item><item><title>Histogram of Oriented Gradients (HOG)</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/hog/</link><pubDate>Sat, 20 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/hog/</guid><description>&lt;h2 id="what-is-a-feature-descriptor">What is a Feature Descriptor&lt;/h2>
&lt;p>A feature descriptor is a &lt;strong>representation of an image or an image patch that simplifies the image by extracting useful information and throwing away extraneous information&lt;/strong>.&lt;/p>
&lt;p>Typically, a feature descriptor converts an image of size $\text{width} \times \text{height} \times 3 \text{(channels)}$ to a feature vector / array of length $n$. In the case of the HOG feature descriptor, the input image is of size $64 \times 128 \times 3$ and the output feature vector is of length $3780$.&lt;/p>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">HOG descriptor can be calculated for other sizes. Here we just stick to numbers presented in the original paper for the sake of simplicity.&lt;/span>
&lt;/div>
&lt;h2 id="how-to-calculate-histogram-of-oriented-gradients">How to calculate Histogram of Oriented Gradients?&lt;/h2>
&lt;p>In this section, we will go into the details of calculating the HOG feature descriptor. To illustrate each step, we will use a patch of an image.&lt;/p>
&lt;h3 id="1-preprocessing">1. Preprocessing&lt;/h3>
&lt;p>Typically patches at multiple scales are analyzed at many image locations. The only constraint is that the patches being analyzed have a fixed aspect ratio.&lt;/p>
&lt;p>In our case, the patches need to have an aspect ratio of 1:2. For example, they can be 100×200, 128×256, or 1000×2000 but not 101×205.&lt;/p>
&lt;p>For the example image of size 720x475 below, we select a patch of size 100x200 for calculating HOG feature descriptor. This patch is then cropped out of an image and resized to 64×128.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/hog-preprocessing.jpg" alt="HOG Preprocessing">&lt;/p>
&lt;h3 id="2-calculate-the-gradient-images">2. Calculate the Gradient Images&lt;/h3>
&lt;p>To calculate a HOG descriptor, we need to first calculate the horizontal and vertical gradients; after all, we want to calculate the histogram of gradients.&lt;/p>
&lt;p>Calculating the horizontal and vertical gradients is easily achieved by filtering the image with the following kernels (&lt;strong>Sobel&lt;/strong> operator).&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/gradient-kernels.jpg">&lt;figcaption>
&lt;h4>Kernels for gradient calculation (left: $x$-gradient, right: $y$-gradient).&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>Next, we can find the magnitude and direction of gradient using the following formula:
&lt;/p>
$$
\begin{array}{l}
g=\sqrt{g\_{x}^{2}+g\_{y}^{2}} \\\\
\theta=\arctan \frac{g\_{y}}{g\_{x}}
\end{array}
$$
&lt;p>
The figure below shows the gradients:&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/gradients.png">&lt;figcaption>
&lt;h4>Left : Absolute value of x-gradient. Center : Absolute value of y-gradient. Right : Magnitude of gradient.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>At every pixel, the gradient has a magnitude and a direction. For color images, the gradients of the three channels are evaluated ( as shown in the figure above ). The magnitude of gradient at a pixel is the maximum of the magnitude of gradients of the three channels, and the angle is the angle corresponding to the maximum gradient.&lt;/p>
&lt;h3 id="3-calculate-histogram-of-gradients-in-88-cells">3. Calculate Histogram of Gradients in 8×8 cells&lt;/h3>
&lt;p>In this step, the image is divided into 8×8 cells and a histogram of gradients is calculated for each 8×8 cells.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/hog-cells.png" alt="8x8 cells of HOG" style="zoom:80%;" />
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;ul>
&lt;li>
&lt;p>&lt;strong>Why divide into patches?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Using feature descriptor to describe a &lt;strong>patch&lt;/strong> of an images provides a compact representation.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Not only is the representation more compact, calculating a histogram over a patch makes this represenation more robust to noise. Individual graidents may have noise, but a histogram over 8×8 patch makes the representation much less sensitive to noise.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Why 8x8 batchs?&lt;/strong>&lt;/p>
&lt;p>It is a design choice informed by the scale of features we are looking for. HOG was used for pedestrian detection initially. 8×8 cells in a photo of a pedestrian scaled to 64×128 are big enough to capture interesting features ( e.g. the face, the top of the head etc. ).&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/span>
&lt;/div>
&lt;p>Let&amp;rsquo;s look at one 8×8 patch in the image and see how the gradients look.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/hog-cell-gradients.png">&lt;figcaption>
&lt;h4>Center : The RGB patch and gradients represented using arrows. Right : The gradients in the same patch represented as numbers&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;ul>
&lt;li>
&lt;p>The image in the center shows the patch of the image overlaid with arrows showing the gradient — the arrow shows the direction of gradient and its length shows the magnitude. The direction of arrows points to the direction of change in intensity and the magnitude shows how big the difference is.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>On the right, gradient direction is represented by angles between 0 and 180 degrees instead of 0 to 360 degrees. These are called &lt;strong>“unsigned” gradients&lt;/strong> because a gradient and it’s negative are represented by the same numbers. Empirically it has been shown that unsigned gradients work better than signed gradients for pedestrian detection.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>The next step is to create a histogram of gradients in these 8×8 cells. The histogram contains 9 bins corresponding to angles 0, 20, 40 … 160 of $y$-axis.&lt;/p>
&lt;p>The following figure illustrates the process. We are looking at magnitude and direction of the gradient of the same 8×8 patch as in the previous figure. A bin is selected based on the direction, and the vote ( the value that goes into the bin ) is selected based on the magnitude.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/hog-histogram-1.png" alt="Histogram computation in HOG">&lt;/p>
&lt;ul>
&lt;li>For the pixel encircled in blue: It has an angle ( direction ) of 80 degrees and magnitude of 2. So it adds 2 to the 5th bin (bin for angle 80).&lt;/li>
&lt;li>For the pixel encircled in red: It has an angle of 10 degrees and magnitude of 4. Since 10 degrees is half way between 0 and 20, the vote by the pixel splits evenly into the two bins.&lt;/li>
&lt;/ul>
&lt;p>One more detail to be aware of: If the angle is greater than 160 degrees, it is between 160 and 180, and we know the angle wraps around making 0 and 180 equivalent. So in the example below, the pixel with angle 165 degrees contributes &lt;em>&lt;strong>proportionally&lt;/strong>&lt;/em> to the 0 degree bin and the 160 degree bin.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/hog-histogram-2.png" alt="Histogram computation in HOG">&lt;/p>
&lt;p>The contributions of all the pixels in the 8×8 cells are added up to create the 9-bin histogram. For the patch above, it looks like this&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/histogram-8x8-cell.png" alt="Histogram of 8x8 cell ">&lt;/p>
&lt;p>As aforementioned, the $y$-axis is 0 degrees. We can see the histogram has a lot of weight near 0 and 180 degrees, which is just another way of saying that in the patch gradients are pointing either up or down.&lt;/p>
&lt;h3 id="4-1616-block-normalization">4. 16×16 Block Normalization&lt;/h3>
&lt;p>In the last step, we created a histogram based on the gradient of the image. However, gradients of an image are sensitive to overall lighting. Ideally, we want our descriptor to be independent of lighting variations. In other words, we would like to “normalize” the histogram so they are not affected by lighting variations.&lt;/p>
&lt;p>Instead of normalizing just a single 8x8 cell, we&amp;rsquo;ll normalize over a bigger sized block of 16×16. A 16×16 block has 4 histograms which can be concatenated to form a 36 x 1 element vector. The window is then moved by 8 pixels (see animation) and a normalized 36×1 vector is calculated over this window and the process is repeated.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/hog-16x16-block-normalization.gif" alt="HOG 16x16 Block Normalization">&lt;/p>
&lt;h3 id="5-calculate-the-hog-feature-vector">5. Calculate the HOG feature vector&lt;/h3>
&lt;p>To calculate the final feature vector for the entire image patch, the 36×1 vectors are concatenated into one giant vector:&lt;/p>
&lt;ul>
&lt;li>Number of 16x16 blocks: $7 \times 15 = 105$&lt;/li>
&lt;li>Each 16x16 block is represented by a $36\times1$ vector&lt;/li>
&lt;/ul>
&lt;p>Therefore, the giant vector has the dimension $36 \times 105 = 3780$&lt;/p>
&lt;h2 id="hog-visualization">HOG visualization&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">skimage&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">io&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">skimage.feature&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">hog&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">skimage&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">data&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">exposure&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">google.colab.patches&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">cv2_imshow&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">matplotlib.pyplot&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">plt&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">io&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imread&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;https://pic4.zhimg.com/80/v2-2ccc671e60031942dca8a129410a0383_720w.jpg&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fd&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">hog_image&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">hog&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">orientations&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">8&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">pixels_per_cell&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">16&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">16&lt;/span>&lt;span class="p">),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">cells_per_block&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">visualize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">multichannel&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fig&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">ax1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ax2&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">figsize&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">10&lt;/span>&lt;span class="p">),&lt;/span> &lt;span class="n">sharex&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">sharey&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">cmap&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">gray&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax1&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Input image&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Rescale histogram for better display&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">hog_image_rescaled&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">exposure&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rescale_intensity&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">hog_image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">in_range&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">10&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax2&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;off&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax2&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">hog_image_rescaled&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">cmap&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">gray&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ax2&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;HOG&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Input_vs_HOG.png" alt="Input_vs_HOG" style="zoom: 25%;" />
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://learnopencv.com/histogram-of-oriented-gradients/#disqus_thread">Histogram of Oriented Gradients&lt;/a>: clear and detailed explanation 👍&lt;/li>
&lt;li>&lt;a href="https://shartoo.github.io/2019/03/04/HOG-feature/">HOG特征详解&lt;/a>: HOG visualization&lt;/li>
&lt;li>Video explanation:
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
&lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/0Zib1YEE4LU?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
>&lt;/iframe>
&lt;/div>
&lt;/li>
&lt;/ul></description></item><item><title>Overview of Region-based Object Detectors</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/overview-region-based-detectors/</link><pubDate>Sat, 20 Feb 2021 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/object-detection/overview-region-based-detectors/</guid><description>&lt;h2 id="sliding-window-detectors">Sliding-window detectors&lt;/h2>
&lt;p>A brute force approach for object detection is to &lt;strong>slide windows from left and right, and from up to down&lt;/strong> to identify objects using classification. To detect different object types at various viewing distances, we use windows of varied sizes and aspect ratios.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*-GaZ8hGBKsbtGfRJqvOVHQ.jpeg" alt="Image for post">&lt;/p>
&lt;p>We cut out patches from the picture according to the sliding windows. The patches are warped since many classifiers take fixed size images only.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*A7DE4HKukbXpQqwvCaLOEQ.jpeg">&lt;figcaption>
&lt;h4>Warp an image to a fixed size image&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>The warped image patch is fed into a CNN classifier to extract 4096 features. Then we apply a SVM classifier to identify the class and another linear regressor for the boundary box.&lt;/p>
&lt;p>System flow:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*BYSA3iip3Cdr0L_x5r468A.png" alt="Image for post">&lt;/p>
&lt;p>Pseudo-code:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">window&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">windows&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">get_patch&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">window&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">results&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We create many windows to detect different object shapes at different locations. To improve performance, one obvious solution is to &lt;strong>reduce the number of &lt;em>windows&lt;/em>&lt;/strong>.&lt;/p>
&lt;h2 id="selective-search">Selective Search&lt;/h2>
&lt;p>Instead of a brute force approach, we use a region proposal method to create &lt;strong>regions of interest (ROIs)&lt;/strong> for object detection.&lt;/p>
&lt;p>In &lt;strong>selective search&lt;/strong> (&lt;strong>SS&lt;/strong>)&lt;/p>
&lt;ol>
&lt;li>We start with each individual pixel as its own group&lt;/li>
&lt;li>We calculate the texture for each group and combine two that are the closest ( to avoid a single region in gobbling others, we prefer grouping smaller group first).&lt;/li>
&lt;li>We continue merging regions until everything is combined together.&lt;/li>
&lt;/ol>
&lt;p>The figure below illustrates this process:&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/700/1*_8BNWWwyod1LWUdzcAUr8w.png">&lt;figcaption>
&lt;h4>In the first row, we show how we grow the regions, and the blue rectangles in the second rows show all possible ROIs we made during the merging.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;h2 id="r-cnn-1">R-CNN &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>&lt;strong>Region-based Convolutional Neural Networks (R-CNN )&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>Uses of a region proposal method to create about 2000 &lt;strong>ROI&lt;/strong>s (regions of interest).&lt;/li>
&lt;li>The regions are warped into fixed size images and feed into a CNN network individually.&lt;/li>
&lt;li>Uses fully connected layers to classify the object and to refine the boundary box.&lt;/li>
&lt;/ol>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*Wmw21tBUez37bj-1ws7XEw.jpeg">&lt;figcaption>
&lt;h4>R-CNN uses **regional proposals**, **CNN**, **FC layers** to locate objects.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-21%2022.08.28.png" alt="截屏2021-02-21 22.08.28">&lt;/p>
&lt;p>&lt;strong>System flow&lt;/strong>:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*ciyhZpgEvxDm1YxZd1SJWg.png" alt="Image for post">&lt;/p>
&lt;p>&lt;strong>Pseudo-code&lt;/strong>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># RoI from a proposal method (~2k)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">get_patch&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">results&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>With far fewer but higher quality ROIs, R-CNN run faster and more accurate than the sliding windows. However, R-CNN is still very slow, because it need to do about 2k independent forward passes for each image! 🤪&lt;/p>
&lt;h2 id="fast-r-cnn-2">Fast R-CNN &lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/h2>
&lt;p>How does Fast R-CNN work?&lt;/p>
&lt;ul>
&lt;li>Instead of extracting features for each image patch from scratch, we use a &lt;strong>feature extractor&lt;/strong> (a CNN) to extract features for the whole image first.&lt;/li>
&lt;li>We also use an &lt;strong>external region proposal method&lt;/strong>, like the selective search, to create ROIs which later combine with the corresponding feature maps to form patches for object detection.&lt;/li>
&lt;li>We warp the patches to a fixed size using &lt;strong>ROI pooling&lt;/strong> and feed them to fully connected layers for classification and &lt;strong>localization&lt;/strong> (detecting the location of the object).&lt;/li>
&lt;/ul>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*Dd3-sugNKInTIv12u8cWkw.jpeg">&lt;figcaption>
&lt;h4>Fast R-CNN apply region proposal **on feature maps** and form fixed size patches using **ROI pooling**.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-21%2022.40.17.png">&lt;figcaption>
&lt;h4>Fast R-CNN vs. R-CNN &lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>&lt;strong>System flow&lt;/strong>:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*fLMNHfe_QFxW569s4eR7Dg.jpeg" alt="Image for post">&lt;/p>
&lt;p>&lt;strong>Pseudo-code&lt;/strong>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">feature_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">process&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">roi_pooling&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">results&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>The expensive feature extraction is moving out of the for-loop. This is a significant speed improvement since it was executed for all 2000 ROIs. &amp;#x1f44f;&lt;/li>
&lt;/ul>
&lt;p>One major takeaway for Fast R-CNN is that the whole network (the feature extractor, the classifier, and the boundary box regressor) are trained end-to-end with &lt;strong>multi-task losses&lt;/strong> (classification loss and localization loss). This improves accuracy.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/20180502185247910.png" alt="img">&lt;/p>
&lt;h3 id="roi-pooling">ROI pooling&lt;/h3>
&lt;p>Because Fast R-CNN uses fully connected layers, we apply &lt;strong>ROI pooling&lt;/strong> to warp the variable size ROIs into in a predefined fix size shape.&lt;/p>
&lt;p>&lt;em>Let&amp;rsquo;s take a look at a simple example: transforming 8 × 8 feature maps into a predefined 2 × 2 shape.&lt;/em>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*LLP4tKGsYGgAx3uPfmGdsw.png" alt="Image for post" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>Top left: feature maps&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Top right: we overlap the ROI (blue) with the feature maps.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Bottom left: we split ROIs into the target dimension. For example, with our 2×2 target, we split the ROIs into 4 sections with similar or equal sizes.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Bottom right: find the &lt;strong>maximum&lt;/strong> for each section (i.e, max-pool within each section) and the result is our warped feature maps.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Now we get a 2 × 2 feature patch that we can feed into the classifier and box regressor.&lt;/p>
&lt;p>&lt;em>Another gif example&lt;/em>:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*5V5mycIRNu-mK-rPywL57w.gif" alt="Image for post">&lt;/p>
&lt;h3 id="problems-of-fast-r-cnn">Problems of Fast R-CNN&lt;/h3>
&lt;p>Fast R-CNN depends on an external region proposal method like selective search. &lt;strong>However, those algorithms run on CPU and they are slow&lt;/strong>. In testing, Fast R-CNN takes 2.3 seconds to make a prediction in which 2 seconds are for generating 2000 ROIs!!!&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">feature_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">process&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># Expensive!&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">roi_pooling&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">results&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector2&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="faster-r-cnn-3-make-cnn-do-proposals">&lt;strong>Faster R-CNN&lt;/strong> &lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>: Make CNN do proposals&lt;/h2>
&lt;p>Faster R-CNN adopts similar design as the Fast R-CNN &lt;strong>except&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>it replaces the region proposal method by an internal deep network called Region Proposal Network (RPN)&lt;/strong>&lt;/li>
&lt;li>&lt;strong>the ROIs are derived from the feature maps instead&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;p>System flow: (same as Fast R-CNN)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*F-WbcUMpWSE1tdKRgew2Ug.png" alt="Image for post">&lt;/p>
&lt;p>The network flow is similar but the region proposal is now replaced by a internal convolutional network, Region Proposal Network (RPN).&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*0cxB2pAxQ0A7AhTl-YT2JQ.jpeg">&lt;figcaption>
&lt;h4>The external region proposal is replaced by an internal Region Proposal Network (RPN).&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*JQfhkHK6V8NRuh-97Pg4lQ.png" alt="Image for post" style="zoom:80%;" />
&lt;p>&lt;strong>Pseudo-code&lt;/strong>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">feature_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">process&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># use RPN&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">roi_pooling&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_scores&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">box&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_probabilities&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">softmax&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">class_scores&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="region-proposal-network-rpn">Region proposal network (RPN)&lt;/h3>
&lt;p>The region proposal network (&lt;strong>RPN&lt;/strong>)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>takes the output feature maps from the first convolutional network as input&lt;/p>
&lt;/li>
&lt;li>
&lt;p>slides 3 × 3 filters over the feature maps to make class-agnostic region proposals using a convolutional network like ZF network&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/1000/1*z0OHn89t0bOIHwoIOwNDtg.jpeg">&lt;figcaption>
&lt;h4>ZF network&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>Other deep network likes VGG or ResNet can be used for more comprehensive feature extraction at the cost of speed.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The ZF network outputs 256 values, which is feed into 2 separate fully connected (FC) layers to predict a boundary box and 2 objectness scores.&lt;/p>
&lt;ul>
&lt;li>The &lt;strong>objectness&lt;/strong> measures whether the box contains an object. We can use a regressor to compute a single objectness score but for simplicity, Faster R-CNN uses a classifier with 2 possible classes: one for the “have an object” category and one without (i.e. the background class).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>For each location in the feature maps, RPN makes &lt;strong>$k$&lt;/strong> guesses&lt;/p>
&lt;p>$\Rightarrow$ RPN outputs $4 \times k$ coordinates (top-left and bottom-right $(x, y)$ coordinates) for bounding box and $2 \times k$ scores for objectness (with vs. without object) per location&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Example: $8 \times 8$ feature maps with a $3 \times 3$ filter, and it outputs a total of $8 \times 8 \times 3$ ROIs (for $k = 3$)&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*smu6PiCx4LaPwGIo3HG0GQ.jpeg" alt="Image for post">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Here we get 3 guesses and we will refine our guesses later. Since we just need one to be correct, we will be better off if our initial guesses have different shapes and size.&lt;/p>
&lt;p>Therefore, Faster R-CNN does not make random boundary box proposals. Instead, it predicts offsets like $\delta\_x, \delta\_y$ that are relative to the top left corner of some reference boxes called &lt;strong>anchors&lt;/strong>. We constraints the value of those offsets so our guesses still resemble the anchors.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*yF_FrZAkXA3XKFA-sf7XZw.png" alt="Image for post">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>To make $k$ predictions per location, we need $k$ anchors centered at each location. Each prediction is associated with a specific anchor but different locations share the &lt;strong>same&lt;/strong> anchor shapes.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*RJoauxGwUTF17ZANQmL8jw.png" alt="Image for post">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Those anchors are carefully pre-selected so they are diverse and cover real-life objects at different scales and aspect ratios reasonable well.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>This guides the initial training with better guesses and allows each prediction to specialize in a certain shape. This strategy makes early training more stable and easier. 👍&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Faster R-CNN uses far more anchors. It deploys 9 anchor boxes: &lt;strong>3 different scales at 3 different aspect ratio.&lt;/strong> Using 9 anchors per location, it generates 2 × 9 objectness scores and 4 × 9 coordinates per location.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*PszFnq3rqa_CAhBrI94Eeg.png" alt="Image for post">&lt;/p>
&lt;/li>
&lt;/ul>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;strong>Anchors&lt;/strong> are also called &lt;strong>priors&lt;/strong> or &lt;strong>default boundary boxes&lt;/strong> in different papers.&lt;/span>
&lt;/div>
&lt;details>
&lt;summary>Nice example and explanation from Stanford cs231n slide&lt;/summary>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-22%2022.04.55.png" alt="截屏2021-02-22 22.04.55">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-22%2022.09.09.png" alt="截屏2021-02-22 22.09.09">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-22%2022.05.14.png" alt="截屏2021-02-22 22.05.14">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-22%2022.05.22.png" alt="截屏2021-02-22 22.05.22">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-02-22%2022.05.32.png" alt="截屏2021-02-22 22.05.32">&lt;/p>
&lt;/details>
&lt;h2 id="region-based-fully-convolutional-networks-r-fcn-4">Region-based Fully Convolutional Networks (R-FCN) &lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>&lt;/h2>
&lt;h3 id="-idea">💡 Idea&lt;/h3>
&lt;p>Let’s assume we only have a feature map detecting the right eye of a face. Can we use it to locate a face? It should. Since the right eye should be on the top-left corner of a facial picture, we can use that to locate the face.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*gqxSBKVla8dzwADKgADpWg-20210222160628867.jpeg" alt="Image for post">&lt;/p>
&lt;p>If we have other feature maps specialized in detecting the left eye, the nose or the mouth, we can combine the results together to locate the face better.&lt;/p>
&lt;h3 id="problem-of-faster-r-cnn">Problem of Faster R-CNN&lt;/h3>
&lt;p>In Faster R-CNN, the &lt;em>detector&lt;/em> applies multiple fully connected layers to make predictions. With 2,000 ROIs, it can be expensive.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">feature_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">process&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">patch&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">roi_pooling&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_scores&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">box&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">detector&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">patch&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># Expensive!&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_probabilities&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">softmax&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">class_scores&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="r-fcn-reduce-the-amount-of-work-needed-for-each-roi">R-FCN: reduce the amount of work needed for each ROI&lt;/h3>
&lt;p>R-FCN improves speed by &lt;strong>reducing the amount of work needed for each ROI.&lt;/strong> The region-based feature maps above are independent of ROIs and can be computed outside each ROI. The remaining work is then much simpler and therefore R-FCN is faster than Faster R-CNN.&lt;/p>
&lt;p>&lt;strong>Pseudo-code&lt;/strong>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">feature_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">process&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ROIs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_proposal&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">score_maps&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">compute_score_map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">feature_maps&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">for&lt;/span> &lt;span class="n">ROI&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">ROIs&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">V&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">region_roi_pool&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">score_maps&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ROI&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_scores&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">box&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">average&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">V&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># Much simpler!&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">class_probabilities&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">softmax&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">class_scores&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="position-sensitive-score-mapping">&lt;strong>Position-sensitive score mapping&lt;/strong>&lt;/h3>
&lt;p>Let’s consider a 5 × 5 feature map &lt;strong>M&lt;/strong> with a blue square object inside. We divide the square object equally into 3 × 3 regions.&lt;/p>
&lt;p>Now, we create a new feature map from M to detect the top left (TL) corner of the square only. The new feature map looks like the one on the right below. &lt;strong>Only the yellow grid cell [2, 2] is activated.&lt;/strong>&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/700/1*S0enLblW1t7VK19E1Fs4lw.png">&lt;figcaption>
&lt;h4>Create a new feature map from the left to detect the top left corner of an object.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>Since we divide the square into 9 parts, we can create 9 feature maps each detecting the corresponding region of the object. These feature maps are called &lt;strong>position-sensitive score maps&lt;/strong> because each map detects (scores) a sub-region of the object.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*HaOHsDYAf8LU2YQ7D3ymOg.png" alt="Image for post">&lt;/p>
&lt;p>Let’s say the dotted red rectangle below is the ROI proposed. We divide it into 3 × 3 regions and ask &lt;strong>how likely each region contains the corresponding part of the object&lt;/strong>.&lt;/p>
&lt;p>For example, how likely the top-left ROI region contains the left eye. We store the results into a 3 × 3 vote array in the right diagram below. For example, &lt;code>vote_array[0][0]&lt;/code> contains the score on whether we find the top-left region of the square object.&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/700/1*Ym6b1qS0pXpeRVMysvvukg.jpeg">&lt;figcaption>
&lt;h4>Apply ROI onto the feature maps to output a 3 x 3 array.&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>This process to map score maps and ROIs to the vote array is called &lt;strong>position-sensitive&lt;/strong> &lt;strong>ROI-pool&lt;/strong>.&lt;/p>
&lt;figure>&lt;img src="https://miro.medium.com/max/700/1*K4brSqensF8wL5i6JV1Eig.png">&lt;figcaption>
&lt;h4>Overlay a portion of the ROI onto the corresponding score map to calculate `V[i][j]`&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;p>After calculating all the values for the position-sensitive ROI pool, &lt;strong>the class score is the average of all its elements.&lt;/strong>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/1*ZJiWcIl2DUyx1-ZqArw33A.png" alt="Image for post" style="zoom:80%;" />
&lt;h3 id="data-flow">Data flow&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Let’s say we have &lt;strong>$C$&lt;/strong> classes to detect.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>We expand it to $C + 1$ classes so we include a new class for the background (non-object). Each class will have its own $3 \times 3$ score maps and therefore a total of $(C+1) \times 3 \times 3$ score maps.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Using its own set of score maps, we predict a class score for each class.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Then we apply a softmax on those scores to compute the probability for each class.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;figure>&lt;img src="https://miro.medium.com/max/1000/1*Gv45peeSM2wRQEdaLG_YoQ.png">&lt;figcaption>
&lt;h4>Data flow of R-FCN ($k=3$)&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://jonathan-hui.medium.com/what-do-we-learn-from-region-based-object-detectors-faster-r-cnn-r-fcn-fpn-7e354377a7c9">What do we learn from region based object detectors (Faster R-CNN, R-FCN, FPN)?&lt;/a> - A nice and clear comprehensive tutorial for region-based object detectors&lt;/li>
&lt;li>&lt;a href="http://cs231n.stanford.edu/slides/2020/lecture_12.pdf">Stanford CS231n slides&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://medium.com/cubo-ai/%E7%89%A9%E9%AB%94%E5%81%B5%E6%B8%AC-object-detection-740096ec4540">關於影像辨識，所有你應該知道的深度學習模型&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://blog.csdn.net/v_JULY_v/article/details/80170182">一文读懂目标检测：R-CNN、Fast R-CNN、Faster R-CNN、YOLO、SSD&lt;/a>&lt;/li>
&lt;li>RoI pooling: &lt;a href="https://towardsdatascience.com/understanding-region-of-interest-part-1-roi-pooling-e4f5dd65bb44">Understanding Region of Interest — (RoI Pooling)&lt;/a>&lt;/li>
&lt;/ul>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>Girshick, R., Donahue, J., Darrell, T., &amp;amp; Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. &lt;em>Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition&lt;/em>, 580–587. &lt;a href="https://doi.org/10.1109/CVPR.2014.81">https://doi.org/10.1109/CVPR.2014.81&lt;/a>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2">
&lt;p>Girshick, R. (2015). Fast R-CNN. &lt;em>Proceedings of the IEEE International Conference on Computer Vision&lt;/em>, &lt;em>2015 International Conference on Computer Vision&lt;/em>, &lt;em>ICCV 2015&lt;/em>, 1440–1448. &lt;a href="https://doi.org/10.1109/ICCV.2015.169">https://doi.org/10.1109/ICCV.2015.169&lt;/a>&amp;#160;&lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3">
&lt;p>Ren, S., He, K., Girshick, R., &amp;amp; Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. &lt;em>IEEE Transactions on Pattern Analysis and Machine Intelligence&lt;/em>, &lt;em>39&lt;/em>(6), 1137–1149. &lt;a href="https://doi.org/10.1109/TPAMI.2016.2577031">https://doi.org/10.1109/TPAMI.2016.2577031&lt;/a>&amp;#160;&lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4">
&lt;p>Dai, J., Li, Y., He, K., &amp;amp; Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. &lt;em>Advances in Neural Information Processing Systems&lt;/em>, 379–387.&amp;#160;&lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item></channel></rss>