<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Segmentation | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/segmentation/</link><atom:link href="https://haobin-tan.netlify.app/tags/segmentation/index.xml" rel="self" type="application/rss+xml"/><description>Segmentation</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sat, 19 Dec 2020 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Segmentation</title><link>https://haobin-tan.netlify.app/tags/segmentation/</link></image><item><title>Segmentation</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/segmentation/</link><pubDate>Sat, 19 Dec 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/segmentation/</guid><description/></item><item><title>Semantic Segmentation Overview</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/segmentation/semantic-segmentation-overview/</link><pubDate>Sat, 19 Dec 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/segmentation/semantic-segmentation-overview/</guid><description>&lt;h2 id="what-is-semantic-segmentation">What is Semantic Segmentation?&lt;/h2>
&lt;p>Image segmentation is a computer vision task in which we label specific regions of an image according to what&amp;rsquo;s being shown.&lt;/p>
&lt;p>The goal of semantic image segmentation is to &lt;strong>label &lt;em>each pixel&lt;/em> of an image with a corresponding &lt;em>class&lt;/em> of what is being represented&lt;/strong>. Because we&amp;rsquo;re predicting for every pixel in the image, this task is commonly referred to as &lt;strong>dense prediction&lt;/strong>.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-17-at-7.42.16-PM.png" alt="Screen-Shot-2018-05-17-at-7.42.16-PM">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-01-22%2018.18.37.png" alt="截屏2021-01-22 18.18.37">&lt;/p>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">&lt;p>&lt;strong>Don’t differentiate instances, only care about pixels&lt;/strong>&lt;/p>
&lt;p>We&amp;rsquo;re NOT separating &lt;em>instances&lt;/em> of the same class; we only care about the &lt;strong>category&lt;/strong> of each pixel. In other words, if you have two objects of the same category in your input image, the segmentation map does not inherently distinguish these as separate objects.&lt;/p>
&lt;/span>
&lt;/div>
&lt;h2 id="use-case">Use case&lt;/h2>
&lt;h3 id="autonomous-vehicles">Autonomous vehicles&lt;/h3>
&lt;p>&lt;img src="https://www.jeremyjordan.me/content/images/2018/05/deeplabcityscape.gif" alt="deeplabcityscape">&lt;/p>
&lt;h3 id="medical-image-diagnostics">Medical image diagnostics&lt;/h3>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-23-at-7.17.43-PM.png" alt="Screen-Shot-2018-05-23-at-7.17.43-PM">&lt;/p>
&lt;h2 id="task">Task&lt;/h2>
&lt;p>Our goal is to take either a RGB color image ($height×width×3$) or a grayscale image ($height×width×1$) and output a segmentation map where each pixel contains a class label represented as an integer ($height×width×1$).&lt;/p>
&lt;p>Example:&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-17-at-9.02.15-PM.png">&lt;figcaption>
&lt;h4>*Note: This is a labeled low-resolution prediction map for visual clarity. In reality, the segmentation label resolution should match the original input&amp;#39;s resolution.*&lt;/h4>
&lt;/figcaption>
&lt;/figure>
&lt;h3 id="how-to-make-prediction-output">How to make prediction output?&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>Create our &lt;strong>target&lt;/strong> by one-hot encoding the class labels - essentially creating an output channel for each of the possible classes.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-16-at-9.36.00-PM.png" alt="Screen-Shot-2018-05-16-at-9.36.00-PM">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>A prediction can be collapsed into a segmentation map by taking the &lt;code>argmax&lt;/code> of each depth-wise pixel vector.&lt;/p>
&lt;p>We can easily inspect a target by overlaying it onto the observation.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-16-at-9.36.38-PM.png" alt="Screen-Shot-2018-05-16-at-9.36.38-PM">&lt;/p>
&lt;p>When we overlay a &lt;em>single channel&lt;/em> of our target (or prediction), we refer to this as a &lt;strong>mask&lt;/strong> which illuminates the regions of an image where a specific class is present.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="architecture">Architecture&lt;/h2>
&lt;p>One popular approach for image segmentation models is to follow an &lt;strong>encoder/decoder structure&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>We &lt;em>&lt;strong>downsample&lt;/strong>&lt;/em> the spatial resolution of the input, developing lower-resolution feature mappings which are learned to be highly efficient at discriminating between classes&lt;/li>
&lt;li>then &lt;em>&lt;strong>upsample&lt;/strong>&lt;/em> the feature representations into a full-resolution segmentation map.&lt;/li>
&lt;/ol>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-16-at-10.33.29-PM.png" alt="Screen-Shot-2018-05-16-at-10.33.29-PM">&lt;/p>
&lt;h3 id="methods-for-upsampling">Methods for upsampling&lt;/h3>
&lt;h4 id="unpooling-operations">Unpooling operations&lt;/h4>
&lt;p>Whereas pooling operations downsample the resolution by summarizing a local area with a single value (ie. average or max pooling), &amp;ldquo;&lt;strong>unpooling&lt;/strong>&amp;rdquo; operations upsample the resolution by distributing a single value into a higher resolution.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-19-at-12.54.50-PM.png" alt="Screen-Shot-2018-05-19-at-12.54.50-PM">&lt;/p>
&lt;h4 id="transpose-convolutions">Transpose convolutions&lt;/h4>
&lt;p>&lt;strong>Transpose convolutions&lt;/strong> are by far the most popular approach as they allow for us to develop a &lt;em>learned upsampling&lt;/em>.&lt;/p>
&lt;p>Typical convolution operation&lt;/p>
&lt;ol>
&lt;li>take the dot product of the values currently in the filter&amp;rsquo;s view&lt;/li>
&lt;li>produce a single value for the corresponding output position.&lt;/li>
&lt;/ol>
&lt;p>A transpose convolution essentially does the opposite&lt;/p>
&lt;ol>
&lt;li>
&lt;p>take a single value from the low-resolution feature map&lt;/p>
&lt;/li>
&lt;li>
&lt;p>multiply all of the weights in our filter by this value, projecting those weighted values into the output feature map.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>For filter sizes which produce an overlap in the output feature map (eg. 3x3 filter with stride 2), the overlapping values are simply added together.&lt;/p>
&lt;p>(Unfortunately, this tends to produce a checkerboard artifact in the output and is undesirable, so it&amp;rsquo;s best to ensure that your filter size does not produce an overlap.)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-21-at-11.01.29-PM.png" alt="Screen-Shot-2018-05-21-at-11.01.29-PM">&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h3 id="fully-convolutional-networks">Fully convolutional networks&lt;/h3>
&lt;p>The approach of using a &amp;ldquo;&lt;strong>fully convolutional&lt;/strong>&amp;rdquo; network trained end-to-end, pixels-to-pixels for the task of image segmentation was introduced by &lt;a href="https://arxiv.org/abs/1411.4038">Long et al.&lt;/a> in late 2014.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2021-01-22%2018.20.02.png" alt="截屏2021-01-22 18.20.02">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>adapting existing, well-studied &lt;em>image classification&lt;/em> networks (eg. AlexNet) to serve as the &lt;strong>encoder&lt;/strong> module of the network&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-20-at-9.53.20-AM.png" alt="Screen-Shot-2018-05-20-at-9.53.20-AM">&lt;/p>
&lt;/li>
&lt;li>
&lt;p>appending a &lt;strong>decoder&lt;/strong> module with transpose convolutional layers to upsample the coarse feature maps into a full-resolution segmentation map.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>However, because the encoder module reduces the resolution of the input by a factor of 32, the decoder module &lt;strong>struggles to produce fine-grained segmentations&lt;/strong> 🤪&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-20-at-10.15.09-AM.png" alt="Screen-Shot-2018-05-20-at-10.15.09-AM">&lt;/p>
&lt;h3 id="adding-skip-connections">Adding skip connections&lt;/h3>
&lt;p>The authors address this tension by slowly upsampling (in stages) the encoded representation, adding &amp;ldquo;skip connections&amp;rdquo; from earlier layers, and summing these two feature maps. These skip connections from earlier layers in the network (prior to a downsampling operation) should provide the necessary detail in order to reconstruct accurate shapes for segmentation boundaries.&lt;/p>
&lt;p>Indeed, we can recover more fine-grain detail with the addition of these skip connections.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-20-at-12.10.25-PM.png" alt="Screen-Shot-2018-05-20-at-12.10.25-PM">&lt;/p>
&lt;h3 id="u-net">U-net&lt;/h3>
&lt;p>&lt;a href="https://arxiv.org/abs/1505.04597">Ronneberger et al.&lt;/a> improve upon the &amp;ldquo;fully convolutional&amp;rdquo; architecture primarily through &lt;em>&lt;strong>expanding the capacity of the decoder&lt;/strong>&lt;/em> module of the network.&lt;/p>
&lt;p>They propose the &lt;strong>U-Net architecture&lt;/strong> which &amp;ldquo;consists of a contracting path to capture context and a &lt;em>&lt;strong>symmetric&lt;/strong>&lt;/em> expanding path that enables precise localization.&amp;rdquo; This simpler architecture has grown to be very popular and has been adapted for a variety of segmentation problems.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-20-at-1.46.43-PM.png" alt="U Net">&lt;/p>
&lt;h2 id="metrics">Metrics&lt;/h2>
&lt;p>Intuitively, a successful prediction is one which maximizes the overlap between the predicted and true objects. Two related but different metrics for this goal are the &lt;a href="https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient">Dice&lt;/a> and &lt;a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard&lt;/a> coefficients (or indices):
&lt;/p>
$$
Dice(A, B) = \frac{2 \|A \cap B\|}{\|A\|+\|B\|}, \qquad Jaccard(A, B) = \frac{\|A \cap B\|}{\|A \cup B\|}
$$
&lt;ul>
&lt;li>$A, B$: two segmentation masks for a given class (but the formulas are general, that is, you could calculate this for anything, e.g. a circle and a square)&lt;/li>
&lt;li>$\|\cdot\|$: norm (for images, the area in pixels)&lt;/li>
&lt;li>$\cap, \cup$: intersection and union operators.&lt;/li>
&lt;/ul>
&lt;p>Both the Dice and Jaccard indices are bounded between 0 (when there is no overlap) and 1 (when A and B match perfectly). The Jaccard index is also known as &lt;strong>Intersection over Union (IoU)&lt;/strong>.&lt;/p>
&lt;p>Here is an illustration of the Dice and IoU metrics given two circles representing the ground truth and the predicted masks for an arbitrary object class:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/metrics_iou_dice.png" alt="IoU, Dice example">&lt;/p>
&lt;p>In terms of the &lt;a href="https://en.wikipedia.org/wiki/Confusion_matrix">confusion matrix&lt;/a>, the metrics can be rephrased in terms of true/false positives/negatives:
&lt;/p>
$$
Dice = \frac{2 TP}{2TP+FP+FN}, \qquad Jaccard = IoU = \frac{TP}{TP+FP+FN}
$$
&lt;h3 id="intersection-over-union-iou">Intersection over Union (IoU)&lt;/h3>
&lt;p>The &lt;strong>Intersection over Union (IoU)&lt;/strong> metric, also referred to as the &lt;strong>Jaccard index&lt;/strong>, is essentially a method to quantify the percent overlap between the target mask and our prediction output.&lt;/p>
&lt;p>Quite simply, the IoU metric &lt;strong>measures the number of pixels common between the target and prediction masks divided by the total number of pixels present across &lt;em>both&lt;/em> masks.&lt;/strong>
&lt;/p>
$$
IoU = \frac{{target \cap prediction}}{{target \cup prediction}}
$$
&lt;p>
The IoU score is calculated for each class separately and then &lt;strong>averaged over all classes&lt;/strong> to provide a global, mean IoU score of our semantic segmentation prediction.&lt;/p>
&lt;h4 id="numpy-implementation">Numpy implementation&lt;/h4>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">intersection&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">logical_and&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">target&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">prediction&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">union&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">logical_or&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">target&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">prediction&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">iou_score&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sum&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">intersection&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">/&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sum&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">union&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="example">Example&lt;/h4>
&lt;p>Let&amp;rsquo;s say we&amp;rsquo;re tasked with calculating the IoU score of the following prediction, given the ground truth labeled mask.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/target_prediction.png" alt="target_prediction">&lt;/p>
&lt;p>The &lt;strong>intersection&lt;/strong> ($A∩B$) is comprised of the pixels found in both the prediction mask &lt;em>and&lt;/em> the ground truth mask, whereas the &lt;strong>union&lt;/strong> ($A∪B$) is simply comprised of all pixels found in either the prediction &lt;em>or&lt;/em> target mask.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/intersection_union.png" alt="intersection_union">&lt;/p>
&lt;h3 id="pixel-accuracy">Pixel accuracy&lt;/h3>
&lt;p>Simply report the percent of pixels in the image which were correctly classified. The pixel accuracy is commonly reported for each class separately as well as globally across all classes.&lt;/p>
&lt;p>When considering the per-class pixel accuracy we&amp;rsquo;re essentially evaluating a binary mask:
&lt;/p>
$$
accuracy = \frac{{TP + TN}}{{TP + TN + FP + FN}}
$$
&lt;p>
However, this metric can sometimes provide misleading results when the class representation is small within the image, as the measure will be biased in mainly reporting how well you identify negative case (ie. where the class is not present).&lt;/p>
&lt;h2 id="loss-function">Loss function&lt;/h2>
&lt;h3 id="pixel-wise-cross-entropy-loss">Pixel-wise cross entropy loss&lt;/h3>
&lt;p>This loss examines &lt;em>each pixel individually&lt;/em>, comparing the class predictions (depth-wise pixel vector) to our one-hot encoded target vector.&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Screen-Shot-2018-05-24-at-10.46.16-PM.png" alt="cross entropy">&lt;/p>
&lt;p>However, because the cross entropy loss evaluates the class predictions for each pixel vector individually and then averages over all pixels, we&amp;rsquo;re essentially asserting equal learning to each pixel in the image. This can be a problem if your various classes have unbalanced representation in the image, as training can be dominated by the most prevalent class.&lt;/p>
&lt;h3 id="dice-loss">Dice loss&lt;/h3>
&lt;p>Another popular loss function for image segmentation tasks is based on the &lt;a href="https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient">Dice coefficient&lt;/a>, which is essentially a measure of overlap between two samples.&lt;/p>
&lt;p>This measure ranges from 0 to 1 where a Dice coefficient of 1 denotes perfect and complete overlap. The Dice coefficient was originally developed for binary data, and can be calculated as:
&lt;/p>
$$
Dice = \frac{{2\left| {A \cap B} \right|}}{{\left| A \right| + \left| B \right|}}
$$
&lt;ul>
&lt;li>${\left| {A \cap B} \right|}$: common elements between sets $A$ and $B$&lt;/li>
&lt;li>$|A|$: number of elements in set $A$ (and likewise for set $B$).&lt;/li>
&lt;/ul>
&lt;p>To evaluate a Dice coefficient on predicted segmentation masks, we can approximate $|A ∩ B|$ as the element-wise multiplication between the prediction and target mask, and then sum the resulting matrix:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/intersection-1.png" alt="intersection">&lt;/p>
&lt;p>Because our target mask is binary, we effectively zero-out any pixels from our prediction which are not &amp;ldquo;activated&amp;rdquo; in the target mask. For the remaining pixels, we are essentially penalizing low-confidence predictions; &lt;strong>a higher value for this expression, which is in the numerator, leads to a better Dice coefficient.&lt;/strong>&lt;/p>
&lt;p>In order to formulate a loss function which can be minimized, we&amp;rsquo;ll simply use
&lt;/p>
$$
1 - Dice
$$
&lt;p>
This loss function is known as the &lt;strong>soft Dice loss&lt;/strong> because we directly use the predicted probabilities instead of thresholding and converting them into a binary mask.&lt;/p>
&lt;p>With respect to the neural network output, the numerator is concerned with the &lt;em>common activations&lt;/em> between our prediction and target mask, where as the denominator is concerned with the quantity of activations in each mask &lt;em>separately&lt;/em>. This has the effect of &lt;strong>normalizing&lt;/strong> our loss according to the size of the target mask such that the soft Dice loss does not struggle learning from classes with lesser spatial representation in an image.&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>🔥 Overview: &lt;a href="https://www.jeremyjordan.me/semantic-segmentation/">An overview of semantic image segmentation.&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Segmentation metrics:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://ilmonteux.github.io/2019/05/10/segmentation-metrics.html">Metrics for semantic segmentation&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>🔥 &lt;a href="https://www.jeremyjordan.me/evaluating-image-segmentation-models/">Evaluating image segmentation models.&lt;/a>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>Semantic Segmentation with PyTorch</title><link>https://haobin-tan.netlify.app/docs/ai/computer-vision/segmentation/sem_seg_pytorch/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/computer-vision/segmentation/sem_seg_pytorch/</guid><description>&lt;h2 id="what-is-semantic-segmentation">What is Semantic Segmentation?&lt;/h2>
&lt;p>Semantic Segmentation is an image analysis task in which we classify each pixel in the image into a class.&lt;/p>
&lt;p>Let&amp;rsquo;s say we have the following image:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/index3.png" alt="img">&lt;/p>
&lt;p>Its semantically segmentated image would be the following:&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/index4.png" alt="img">&lt;/p>
&lt;p>Each pixel in the image is classified to its respective class.&lt;/p>
&lt;h2 id="use-pytorch-for-semantic-segmentation">Use PyTorch for Semantic Segmentation&lt;/h2>
&lt;h3 id="input-and-output">Input and Output&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Segmentation models expect a 3-channled image which is normalized with the Imagenet mean and standard deviation, i.e.,
&lt;code>mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Input is &lt;code>[Ni x Ci x Hi x Wi]&lt;/code>&lt;/p>
&lt;ul>
&lt;li>&lt;code>Ni&lt;/code> -&amp;gt; the batch size&lt;/li>
&lt;li>&lt;code>Ci&lt;/code> -&amp;gt; the number of channels (which is 3)&lt;/li>
&lt;li>&lt;code>Hi&lt;/code> -&amp;gt; the height of the image&lt;/li>
&lt;li>&lt;code>Wi&lt;/code> -&amp;gt; the width of the image&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Output of the model is &lt;code>[No x Co x Ho x Wo]&lt;/code>&lt;/p>
&lt;ul>
&lt;li>&lt;code>No&lt;/code> -&amp;gt; is the batch size (same as &lt;code>Ni&lt;/code>)&lt;/li>
&lt;li>&lt;code>Co&lt;/code> -&amp;gt; &lt;strong>is the number of classes that the dataset have!&lt;/strong>&lt;/li>
&lt;li>&lt;code>Ho&lt;/code> -&amp;gt; the height of the image (which is the same as &lt;code>Hi&lt;/code> in almost all cases)&lt;/li>
&lt;li>&lt;code>Wo&lt;/code> -&amp;gt; the width of the image (which is the same as &lt;code>Wi&lt;/code> in almost all cases)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>The &lt;code>torchvision&lt;/code> models outputs an &lt;code>OrderedDict&lt;/code> and not a &lt;code>torch.Tensor&lt;/code>.
And in &lt;code>.eval()&lt;/code> mode it just has one key &lt;code>out&lt;/code>. The &lt;code>out&lt;/code> key of this &lt;code>OrderedDict&lt;/code> is the key that holds the output and this &lt;code>out&lt;/code> key&amp;rsquo;s value has the shape of &lt;code>[No x Co x Ho x Wo]&lt;/code>.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="implementation">Implementation&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">collections&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">namedtuple&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">PIL&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">Image&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">matplotlib.pyplot&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">plt&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">numpy&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">np&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">torch&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">from&lt;/span> &lt;span class="nn">torchvision&lt;/span> &lt;span class="kn">import&lt;/span> &lt;span class="n">models&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="kn">import&lt;/span> &lt;span class="nn">torchvision.transforms&lt;/span> &lt;span class="k">as&lt;/span> &lt;span class="nn">T&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># ImageNet mean and standard deviation&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">MEAN&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mf">0.485&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">0.456&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">0.406&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">STD&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mf">0.229&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">0.224&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mf">0.225&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># # Pascal VOC dataset segmentation&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">VocClass&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">namedtuple&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;VocClass&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;name&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;id&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;color&amp;#34;&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">classes&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;background&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;aeroplane&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;bicycle&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">2&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;bird&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">3&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;boat&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">4&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;bottle&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">5&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;bus&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">6&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;car&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">7&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;cat&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">8&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">64&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;chair&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">9&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">192&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;cow&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">64&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;dining table&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">11&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">192&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;dog&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">12&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">64&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;horse&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">13&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">192&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;motorbike&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">14&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">64&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;person&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">15&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">192&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;potted plant&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">16&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">64&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;sheep&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">17&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">64&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;sofa&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">18&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">192&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;train&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">19&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">128&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">192&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">VocClass&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;tv/monitor&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">10&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">64&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">128&lt;/span>&lt;span class="p">)),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">decode_seg_map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> Convert a segmentation map of size [1 x num_class x H x W] to a 2D RGB image
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Create empty 2D matrices for all 3 channels of an image&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">r&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">zeros_like&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">uint8&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">g&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">zeros_like&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">uint8&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">b&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">zeros_like&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">uint8&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">class_&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="n">classes&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Get the indexes in the image where that particular class label is present&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">idx&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="n">image&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">class_&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">id&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Put its corresponding color to those pixels&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">r&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">idx&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">g&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">idx&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">b&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">idx&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">class_&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">color&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Stack the 3 seperate channels to form a RGB image&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">rgb_mask&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">stack&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="n">r&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">g&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">b&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">axis&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">rgb_mask&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">show_img&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">img&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">imshow&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">img&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">axis&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;off&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">show&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">segment&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">model&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">img_path&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">show_original_img&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">device&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;cuda&amp;#34;&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">img&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Image&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">open&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">img_path&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">show_original_img&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">show_img&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">img&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">transform&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">T&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Compose&lt;/span>&lt;span class="p">([&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">T&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Resize&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">640&lt;/span>&lt;span class="p">),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># T.CenterCrop(224),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">T&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ToTensor&lt;/span>&lt;span class="p">(),&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">T&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Normalize&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">mean&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">MEAN&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">std&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">STD&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">input&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">transform&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">img&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unsqueeze&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">device&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">output&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">model&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">device&lt;/span>&lt;span class="p">)(&lt;/span>&lt;span class="nb">input&lt;/span>&lt;span class="p">)[&lt;/span>&lt;span class="s2">&amp;#34;out&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">seg_map&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">torch&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">argmax&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">output&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">squeeze&lt;/span>&lt;span class="p">(),&lt;/span> &lt;span class="n">dim&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">detach&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">cpu&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">numpy&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mask&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">decode_seg_map&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">seg_map&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">show_img&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">mask&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="err">!&lt;/span>&lt;span class="n">wget&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">nv&lt;/span> &lt;span class="s2">&amp;#34;https://www.learnopencv.com/wp-content/uploads/2021/01/person-segmentation.jpeg&amp;#34;&lt;/span> &lt;span class="o">-&lt;/span>&lt;span class="n">O&lt;/span> &lt;span class="n">person&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">png&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">fcn&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">models&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">segmentation&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">fcn_resnet101&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">pretrained&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">eval&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">segment&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">fcn&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;./person.png&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/person.png" alt="person">&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/person_sem_seg.png" alt="person_sem_seg">&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;p>&lt;a href="https://colab.research.google.com/github/spmallick/learnopencv/blob/master/PyTorch-Segmentation-torchvision/intro-seg.ipynb#scrollTo=5GA_GNohUHnR&amp;amp;uniqifier=1">intro_seg.ipynb&lt;/a>&lt;/p></description></item></channel></rss>