<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Logistic Regression | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/logistic-regression/</link><atom:link href="https://haobin-tan.netlify.app/tags/logistic-regression/index.xml" rel="self" type="application/rss+xml"/><description>Logistic Regression</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 03 Aug 2020 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Logistic Regression</title><link>https://haobin-tan.netlify.app/tags/logistic-regression/</link></image><item><title>Logistic Regression: Basics</title><link>https://haobin-tan.netlify.app/docs/ai/machine-learning/classification/logistic-regression/logistic-regression/</link><pubDate>Mon, 13 Jul 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/machine-learning/classification/logistic-regression/logistic-regression/</guid><description>&lt;p>💡 &lt;strong>Use regression algorithm for classification&lt;/strong>&lt;/p>
&lt;p>Logistic regression: &lt;strong>estimate the probability that an instance belongs to a particular class&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>If the estimated probability is &lt;strong>greater than 50%&lt;/strong>, then the model predicts that the instance belongs to that class (called the &lt;strong>positive&lt;/strong> class, labeled “1”),&lt;/li>
&lt;li>or else it predicts that it does not (i.e., it belongs to the &lt;strong>negative&lt;/strong> class, labeled “0”).&lt;/li>
&lt;/ul>
&lt;p>This makes it a &lt;strong>binary&lt;/strong> classifier.&lt;/p>
&lt;h2 id="logistic--sigmoid-function">Logistic / Sigmoid function&lt;/h2>
&lt;img src="https://upload.wikimedia.org/wikipedia/commons/5/53/Sigmoid-function-2.svg" style="zoom:60%; background-color:white">
&lt;p>$\sigma(t)=\frac{1}{1+\exp (-t)}$&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Bounded: $\sigma(t) \in (0, 1)$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Symmetric: $1 - \sigma(t) = \sigma(-t)$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Derivative: $\sigma^{\prime}(t)=\sigma(t)(1-\sigma(t))$&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="estimating-probabilities-and-making-prediction">Estimating probabilities and making prediction&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>Computes a weighted sum of the input features (plus a bias term)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Outputs the logistic of this result&lt;/p>
&lt;p>$\hat{p}=h_{\theta}(\mathbf{x})=\sigma\left(\mathbf{x}^{\mathrm{T}} \boldsymbol{\theta}\right)$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Prediction:&lt;/p>
$$
\hat{y} = \begin{cases} 0 &amp; \text{ if } \hat{p}&lt;0.5\left(\Leftrightarrow h_{\theta}(\mathbf{x})&lt;0\right) \\\\
1 &amp; \text{ if }\hat{p} \geq 0.5\left(\Leftrightarrow h_{\theta}(\mathbf{x}) \geq 0\right)\end{cases}
$$
&lt;/li>
&lt;/ol>
&lt;h2 id="train-and-cost-function">Train and cost function&lt;/h2>
&lt;p>Objective of training: to set the parameter vector $\boldsymbol{\theta}$ so that the model estimates:&lt;/p>
&lt;ul>
&lt;li>high probabilities ($\geq 0.5$) for positive instances ($y=1$)&lt;/li>
&lt;li>low probabilities ($&lt; 0.5$) for negative instances ($y=0$)&lt;/li>
&lt;/ul>
&lt;h3 id="cost-function-of-a-single-training-instance">Cost function of a single training instance:&lt;/h3>
$$
c(\boldsymbol{\theta}) = \begin{cases} -\log (\hat{p}) &amp; \text{ if } y=1 \\\\
-\log (1-\hat{p}) &amp; \text{ if } y=0\end{cases}
$$
&lt;blockquote>
&lt;img src="https://miro.medium.com/max/1621/1*_NeTem-yeZ8Pr9cVUoi_HA.png" style="zoom:30%; background-color:white">
&lt;ul>
&lt;li>Actual lable: $y=1$, Misclassification: $\hat{y} = 0 \Leftrightarrow$ $\hat{p} = \sigma(h_{\boldsymbol{\theta}}(x))$ close to 0 $\Leftrightarrow c(\boldsymbol{\theta})$ large&lt;/li>
&lt;li>Actual lable: $y=0$, Misclassification: $\hat{y} = 1 \Leftrightarrow$ $\hat{p} = \sigma(h_{\boldsymbol{\theta}}(x))$ close to 1 $\Leftrightarrow c(\boldsymbol{\theta})$ large&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;h3 id="the-cost-function-over-the-whole-training-set">The cost function over the whole training set&lt;/h3>
&lt;p>Simply the average cost over all training instances (Combining the expressions of two different cases above into one single expression):&lt;/p>
&lt;p>$\begin{aligned} J(\boldsymbol{\theta}) &amp;=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(\hat{p}^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)\right] \\\\ &amp;=\frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)\right] \end{aligned}$&lt;/p>
&lt;blockquote>
&lt;ul>
&lt;li>$y^{(i)} =1:-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)=-\log \left(\hat{p}^{(i)}\right)$&lt;/li>
&lt;li>$y^{(i)} =0:-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)=-\log \left(1-\hat{p}^{(i)}\right)$
(Exactly the same as $c(\boldsymbol{\theta})$ for a single instance above 👏)&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;h3 id="training">Training&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>No closed-form equation 🤪&lt;/p>
&lt;/li>
&lt;li>
&lt;p>But it is convex so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Partial derivatives of the cost function with regards to the $j$-th model parameter $\theta_j$:&lt;/p>
$$
\frac{\partial}{\partial \theta_{j}} J(\boldsymbol{\theta})=\frac{1}{m} \displaystyle \sum_{i=1}^{m}\left(\sigma\left(\boldsymbol{\theta}^{T} \mathbf{x}^{(l)}\right)-y^{(i)}\right) x_{j}^{(i)}
$$
&lt;/li>
&lt;/ul></description></item><item><title>Logistic Regression: Probabilistic view</title><link>https://haobin-tan.netlify.app/docs/ai/machine-learning/classification/logistic-regression/logistic-regression-in-probabilistic-view/</link><pubDate>Mon, 13 Jul 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/machine-learning/classification/logistic-regression/logistic-regression-in-probabilistic-view/</guid><description>&lt;p>Class label:&lt;/p>
$$
y_i \in \\{0, 1\\}
$$
&lt;p>Conditional probability distribution of the class label is&lt;/p>
$$
\begin{aligned}
p(y=1|\boldsymbol{x}) &amp;= \sigma(\boldsymbol{w}^T\boldsymbol{x}+b) \\\\
p(y=0|\boldsymbol{x}) &amp;= 1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}+b)
\end{aligned}
$$
&lt;p>
with&lt;/p>
$$
\sigma(x) = \frac{1}{1+\operatorname{exp}(-x)}
$$
&lt;p>This is a &lt;strong>conditional Bernoulli distribution&lt;/strong>. Therefore, the probability can be represented as&lt;/p>
$$
\begin{array}{ll}
p(y|\boldsymbol{x}) &amp;= p(y=1|\boldsymbol{x})^y p(y=0|\boldsymbol{x})^{1-y} \\\\
&amp; = \sigma(\boldsymbol{w}^T\boldsymbol{x}+b)^y (1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}+b))^{1-y}
\end{array}
$$
&lt;p>The &lt;strong>conditional Bernoulli log-likelihood&lt;/strong> is (assuming training data is i.i.d)&lt;/p>
$$
\begin{aligned}
\operatorname{loglik}(\boldsymbol{w}, \mathcal{D})
&amp;= \log(\operatorname{lik}(\boldsymbol{w}, \mathcal{D})) \\\\
&amp;= \log(\displaystyle\prod_i p(y_i|\boldsymbol{x}_i)) \\\\
&amp;= \log\left(\displaystyle\prod_i \sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)^y \left(1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)\right)^{1-y}\right) \\\\
&amp;= \displaystyle\sum_i y\log\left(\sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)\right)+ (1-y)\log\left(1 - \sigma(\boldsymbol{w}^T\boldsymbol{x}_i+b)\right)
\end{aligned}
$$
&lt;p>Let&lt;/p>
$$
\tilde{\boldsymbol{w}}=\left(\begin{array}{c}1 \\\\ \boldsymbol{w} \end{array}\right), \quad \tilde{\boldsymbol{x}_i}=\left(\begin{array}{c}b \\\\ \boldsymbol{x}_i \end{array}\right)
$$
&lt;p>Then:&lt;/p>
$$
\operatorname{loglik}(\boldsymbol{w}, \mathcal{D}) = \operatorname{loglik}(\tilde{\boldsymbol{w}}, \mathcal{D}) = \displaystyle\sum_i y\log\left(\sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i})\right)+ (1-y)\log\left(1 - \sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i}))\right)
$$
&lt;p>Our objective is to find the $\tilde{\boldsymbol{w}}^*$ that &lt;strong>maximize the log-likelihood&lt;/strong>, i.e.&lt;/p>
$$
\begin{array}{cl}
\tilde{\boldsymbol{w}}^* &amp;= \underset{\tilde{\boldsymbol{w}}}{\arg \max} \quad \operatorname{loglik}(\tilde{\boldsymbol{w}}, \mathcal{D}) \\\\
&amp;= \underset{\tilde{\boldsymbol{w}}}{\arg \min} \quad -\operatorname{loglik}(\tilde{\boldsymbol{w}}, \mathcal{D})\\\\
&amp;= \underset{\tilde{\boldsymbol{w}}}{\arg \min} \quad \underbrace{-\left(\displaystyle\sum_i y\log\left(\sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i})\right) + (1-y)\log\left(1 - \sigma(\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}_i}))\right)\right)}_{\text{cross-entropy loss}}
\end{array}
$$
&lt;p>In other words, &lt;strong>maximizing the (log-)likelihood is the same as minimizing the cross entropy.&lt;/strong>&lt;/p></description></item><item><title>Generative and Discriminative Classifiers</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/generative-discriminative-classifier/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/generative-discriminative-classifier/</guid><description>&lt;p>The most important difference between naive Bayes and logistic regression is that&lt;/p>
&lt;ul>
&lt;li>logistic regression is a &lt;strong>discriminative&lt;/strong> classifier while&lt;/li>
&lt;li>naive Bayes is a &lt;strong>generative&lt;/strong> classifier.&lt;/li>
&lt;/ul>
&lt;p>Consider a visual metaphor: imagine we’re trying to distinguish dog images from cat images.&lt;/p>
&lt;ul>
&lt;li>Generative model
&lt;ul>
&lt;li>Try to understand what dogs look like and what cats look like&lt;/li>
&lt;li>You might literally ask such a model to ‘generate’, i.e. draw, a dog&lt;/li>
&lt;li>Given a test image, the system then asks whether it’s the cat model or the dog model that better fits (is less surprised by) the image, and chooses that as its label.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>disciminative model
&lt;ul>
&lt;li>only trying to learn to distinguish the classes&lt;/li>
&lt;li>So maybe all the dogs in the training data are wearing collars and the cats aren’t. If that one feature neatly separates the classes, the model is satisfied. If you ask such a model what it knows about cats all it can say is that they don’t wear collars.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>More formally, recall that the &lt;a href="https://haobin-tan.netlify.app/docs/ai/natural-language-processing/naive-bayes-classification/naive-bayes-classifiers/">naive Bayes&lt;/a> assigns a class $c$ to a document $d$ NOT by directly computing $p(c|d)$, but by computing a likelihood and a prior.
&lt;/p>
$$
\hat{c}=\underset{c \in C}{\operatorname{argmax}} \overbrace{P(d | c)}^{\text { likelihood prior }} \overbrace{P(c)}^{\text { prior }}
$$
&lt;ul>
&lt;li>
&lt;p>Generative model (like naive Bayes)&lt;/p>
&lt;ul>
&lt;li>Makes use of the likelihood term
&lt;ul>
&lt;li>Expresses how to generate the features of a document &lt;em>if we knew it was of class&lt;/em> $c$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Discriminative model&lt;/p>
&lt;ul>
&lt;li>attempts to directly compute $P(c|d)$&lt;/li>
&lt;li>It will learn to assign a high weight to document features that directly improve its ability to &lt;em>discriminate&lt;/em> between possible classes, even if it couldn’t generate an example of one of the classes.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="components-of-a-probabilistic-machine-learning-classifier">Components of a probabilistic machine learning classifier&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Training corpus of $M$ input/output pairs $(x^{(i)}, y^{(i)})$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>A &lt;strong>feature representation&lt;/strong> of the input&lt;/p>
&lt;ul>
&lt;li>For each input observation $x^{(i)}$, this will be a vector of features $[x_1, x_2, \dots, x_n]$
&lt;ul>
&lt;li>$x_{i}^{(j)}$: feature $i$ for input $x^{(j)}$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>A &lt;strong>classification function&lt;/strong> that computes $\hat{y}$, the estimated class, via $p(y|x)$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An objective function for learning, usually involving minimizing error on&lt;/p>
&lt;p>training examples&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An algorithm for optimizing the objective function.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Logistic regression has two phases:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>training&lt;/strong>: we train the system (specifically the weights $w$ and $b$) using stochastic gradient descent and the cross-entropy loss.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>test&lt;/strong>: Given a test example $x$ we compute $p(y|x)$ and return the higher probability label $y=1$ or $y=0$.&lt;/p>
&lt;/li>
&lt;/ul></description></item><item><title>Sigmoid</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/sigmoid/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/sigmoid/</guid><description>&lt;h2 id="sigmoid-to-logistic-regression">Sigmoid to Logistic Regression&lt;/h2>
&lt;p>Consider a single input observation $x = [x_1, x_2, \dots, x_n]$&lt;/p>
&lt;p>The classifier output $y$ can be&lt;/p>
&lt;ul>
&lt;li>$1$: the observation is a member of the class&lt;/li>
&lt;li>$0$: the observation is NOT a member of the class&lt;/li>
&lt;/ul>
&lt;p>We want to know the &lt;strong>probability&lt;/strong> $P(y=1|x)$ that this observation is a member of the class.&lt;/p>
&lt;p>&lt;em>E.g.:&lt;/em>&lt;/p>
&lt;ul>
&lt;li>&lt;em>The decision is “positive sentiment” versus “negative sentiment”&lt;/em>&lt;/li>
&lt;li>&lt;em>the features represent counts of words in a document&lt;/em>&lt;/li>
&lt;li>&lt;em>$P(y=1|x)$ is the probability that the document has positive sentiment, while and $P(y=0|x)$ is the probability that the document has negative sentiment.&lt;/em>&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Logistic regression solves this task by learning, from a training set, a vector of weights and a bias term.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Each &lt;strong>weight&lt;/strong> $w_i$ is a real number, and is associated with one of the input features $x_i$. The weight represents how important that input feature is to the classification decision, can be&lt;/p>
&lt;ul>
&lt;li>positive (meaning the feature is associated with the class)&lt;/li>
&lt;li>negative (meaning the feature is NOT associated with the class).&lt;/li>
&lt;/ul>
&lt;p>&lt;em>E.g.: we might expect in a sentiment task the word &lt;u>awesome&lt;/u> to have a high positive weight, and &lt;u>abysmal&lt;/u> to have a very negative weight.&lt;/em>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Bias term $b$, also called the &lt;strong>intercept&lt;/strong>, is another real number that’s added to the weighted inputs.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>To make a decision on a test instance, the resulting single number $z$ expresses the weighted sum of the evidence for the class:
&lt;/p>
$$
\begin{array}{ll}
z &amp;=\left(\sum_{i=1}^{n} w_{i} x_{i}\right)+b \\\\
&amp; = w \cdot x + b \\\\
&amp; \in (-\infty, \infty)
\end{array}
$$
&lt;p>
(Note that $z$ is NOT a legal probability, since $z \notin [0, 1]$)&lt;/p>
&lt;p>To create a probability, we’ll pass $z$ through the &lt;strong>sigmoid&lt;/strong> function (also called &lt;strong>logistic function&lt;/strong>):
&lt;/p>
$$
y=\sigma(z)=\frac{1}{1+e^{-z}}
$$
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-05-28%2010.12.24-20200803141941368.png" alt="截屏2020-05-28 10.12.24" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="-advantages-of-sigmoid">👍 &lt;strong>Advantages of sigmoid&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>It takes a real-valued number and maps it into the range [0,1] (which is just what we want for a probability)&lt;/li>
&lt;li>It is nearly linear around 0 but has a sharp slope toward the ends, it tends to squash outlier values toward 0 or 1.&lt;/li>
&lt;li>Differentiable $\Rightarrow$ handy for learning&lt;/li>
&lt;/ul>
&lt;p>To make it a probability, we just need to make sure that the two cases, $P(y=1)$ and $P(y=0)$, sum to 1:
&lt;/p>
$$
\begin{aligned}
P(y=1) &amp;=\sigma(w \cdot x+b) \\\\
&amp;=\frac{1}{1+e^{-(w \cdot x+b)}} \\\\
P(y=0) &amp;=1-\sigma(w \cdot x+b) \\\\
&amp;=1-\frac{1}{1+e^{-(w \cdot x+b)}} \\\\
&amp;=\frac{e^{-(w \cdot x+b)}}{1+e^{-(w \cdot x+b)}}
\end{aligned}
$$
&lt;p>
Now we have an algorithm that given an instance $x$ computes the probability $P(y=1|x)$. For a test instance $x$, we say yes if the probability is $P(y=1|x)$ more than 0.5, and no otherwise. We call 0.5 the &lt;strong>decision boundary&lt;/strong>:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200803142100666.png" alt="image-20200803142100666" style="zoom:18%;" />
&lt;h2 id="example-sentiment-classification">Example: sentiment classification&lt;/h2>
&lt;p>Suppose we are doing binary sentiment classification on movie review text, and we would like to know whether to assign the sentiment class + or − to a review document $doc$.&lt;/p>
&lt;p>We’ll represent each input observation by the 6 features $x_1,...,x_6$ of the input shown in the following table&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-28%2010.32.22.png" alt="截屏2020-05-28 10.32.22" style="zoom:80%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-28%2010.32.56.png" alt="截屏2020-05-28 10.32.56" style="zoom:80%;" />
&lt;p>Assume that for the moment that we’ve already learned a real-valued weight for each of these features, and that the 6 weights corresponding to the 6 features are $w= [2.5,−5.0,−1.2,0.5,2.0,0.7]$, while $b = 0.1$.&lt;/p>
&lt;ul>
&lt;li>The weight $w_1$, for example indicates how important a feature the number of positive lexicon words (&lt;em>great&lt;/em>, &lt;em>nice&lt;/em>, &lt;em>enjoyable&lt;/em>, etc.) is to a positive sentiment decision, while $w_2$tells us the importance of negative lexicon words. Note that $w_1 = 2.5$ is positive, while $w_2 = −5.0$, meaning that negative words are negatively associated with a positive sentiment decision, and are &lt;strong>about twice as important as positive words&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;p>Given these 6 features and the input review $x$, $P(+|x)$ and $P(-|x)$ can be computed:
&lt;/p>
$$
\begin{aligned}
p(+| x)=P(Y=1 | x) &amp;=\sigma(w \cdot x+b) \\\\
&amp;=\sigma([2.5,-5.0,-1.2,0.5,2.0,0.7] \cdot[3,2,1,3,0,4.19]+0.1) \\\\
&amp;=\sigma(0.833) \\\\
&amp;=0.70 \\\\
p(-| x)=P(Y=0 | x) &amp;=1-\sigma(w \cdot x+b) \\\\
&amp;=0.30
\end{aligned}
$$
&lt;p>
$0.70 > 0.50 \Rightarrow$ This sentiment is positive ($+$).&lt;/p></description></item><item><title>The Cross-Entropy Loss Function</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/cross-entropy/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/cross-entropy/</guid><description>&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>We need a loss function that expresses, &lt;strong>for an observation $x$, how close the classifier output ($\hat{y}=\sigma(w \cdot x+b)$) is to the correct output ($y$, which is $0$ or $1$)&lt;/strong>:
&lt;/p>
$$
L(\hat{y}, y)= \text{How much } \hat{y} \text{ differs from the true } y
$$
&lt;p>
This loss function should prefer the correct class labels of the training examples to be &lt;em>more likely&lt;/em>.&lt;/p>
&lt;p>👆 This is called &lt;strong>conditional maximum likelihood estimation&lt;/strong>: we choose the parameters $w, b$ that maximize the log probability of the true $y$ labels in the training data given the observations $x$. The resulting loss function is the &lt;em>negative&lt;/em> log likelihood loss, generally called the &lt;strong>cross-entropy loss&lt;/strong>.&lt;/p>
&lt;h2 id="derivation">Derivation&lt;/h2>
&lt;p>Task: for a single observation $x$, learn weights that maximize $p(y|x)$, the probability of the correct label&lt;/p>
&lt;p>There&amp;rsquo;re only two discretions outcomes ($1$ or $0$)&lt;/p>
&lt;p>$\Rightarrow$ This is a &lt;strong>Bernoulli distribution&lt;/strong>. The probability $p(y|x)$ for one observation can be expressed as:
&lt;/p>
$$
p(y | x)=\hat{y}^{y}(1-\hat{y})^{1-y}
$$
&lt;ul>
&lt;li>$y=1, p(y|x)=\hat{y}$&lt;/li>
&lt;li>$y=0, p(y|x)=1-\hat{y}$&lt;/li>
&lt;/ul>
&lt;p>Now we take the log of both sides. This will turn out to be handy mathematically, and doesn’t hurt us (whatever values maximize a probability will also maximize the log of the probability):
&lt;/p>
$$
\begin{aligned}
\log p(y | x) &amp;=\log \left[\hat{y}^{y}(1-\hat{y})^{1-y}\right] \\\\
&amp;=y \log \hat{y}+(1-y) \log (1-\hat{y})
\end{aligned}
$$
&lt;p>
👆 This is the log likelihood that should be maximized.&lt;/p>
&lt;p>In order to turn this into loss function (something that we need to minimize), we’ll just flip the sign. The result is the &lt;strong>cross-entropy loss&lt;/strong>:
&lt;/p>
$$
L_{C E}(\hat{y}, y)=-\log p(y | x)=-[y \log \hat{y}+(1-y) \log (1-\hat{y})]
$$
&lt;p>
Recall that $\hat{y}=\sigma(w \cdot x+b)$:
&lt;/p>
$$
L_{C E}(w, b)=-[y \log \sigma(w \cdot x+b)+(1-y) \log (1-\sigma(w \cdot x+b))]
$$
&lt;h2 id="example">Example&lt;/h2>
&lt;p>Let’s see if this loss function does the right thing for example above.&lt;/p>
&lt;p>We want the loss to be&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>smaller&lt;/strong> if the model’s estimate is &lt;strong>close to correct&lt;/strong>, and&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>bigger&lt;/strong> if the model is &lt;strong>confused&lt;/strong>.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Let’s suppose the correct gold label for the sentiment example above is positive, i.e.: $y=1$.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>In this case our model is doing well 👏, since it gave the example a a higher probability of being positive ($0.70$) than negative ($0.30$).&lt;/p>
&lt;p>If we plug $\sigma(w \cdot x+b)=0.70$ and $y=1$ into the cross-entropy loss, we get&lt;/p>
&lt;/li>
&lt;/ul>
$$
\begin{aligned}
L_{C E}(w, b) &amp;=-[y \log \sigma(w \cdot x+b)+(1-y) \log (1-\sigma(w \cdot x+b))] \\\\
&amp;=-[\log \sigma(w \cdot x+b)] \\\\
&amp;=-\log (0.70) \\\\
&amp;=0.36
\end{aligned}
$$
&lt;p>By contrast, let&amp;rsquo;s pretend instead that the example was negative, i.e.: $y=0$.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>In this case our model is confused 🤪, and we’d want the loss to be higher.&lt;/p>
&lt;p>If we plug $y=0$ and $1-\sigma(w \cdot x+b)=0.30$ into the cross-entropy loss, we get
&lt;/p>
$$
\begin{aligned}
L_{C E}(w, b) &amp;=-[y \log \sigma(w \cdot x+b)+(1-y) \log (1-\sigma(w \cdot x+b))] \\\\
&amp;= -[\log (1-\sigma(w \cdot x+b))] \\\\
&amp;=-\log (.31) \\\\
&amp;= 1.20
\end{aligned}
$$
&lt;/li>
&lt;/ul>
&lt;p>It&amp;rsquo;s obvious that the lost for the first classifier ($0.36$) is less than the loss for the second classifier ($1.17$).&lt;/p>
&lt;h2 id="why-minimizing-this-negative-log-probability-works">Why minimizing this negative log probability works?&lt;/h2>
&lt;p>A perfect classifier would assign probability $1$ to the correct outcome and probability $0$ to the incorrect outcome. That means:&lt;/p>
&lt;ul>
&lt;li>the higher $\hat{y}$ (the closer it is to 1), the better the classifier;&lt;/li>
&lt;li>the lower $\hat{y}$ is (the closer it is to 0), the worse the classifier.&lt;/li>
&lt;/ul>
&lt;p>The negative log of this probability is a convenient loss metric since it goes from 0 (negative log of 1, no loss) to infinity (negative log of 0, infinite loss). This loss function also ensures that as the probability of the correct answer is maximized, the probability of the incorrect answer is minimized; since the two sum to one, any increase in the probability of the correct answer is coming at the expense of the incorrect answer.&lt;/p></description></item><item><title>Learning in Logistic Regression</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/learning-in-log-reg/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/learning-in-log-reg/</guid><description>&lt;p>Logistic regression is an instance of supervised classification in which we know the correct label $y$ (either 0 or 1) for each observation $x$.&lt;/p>
&lt;p>The system produces/predicts $\hat{y}$, the estimate for the true $y$. We want to learn parameters ($w$ and $b$) that make $\hat{y}$ for each training observation &lt;strong>as close as possible&lt;/strong> to the true $y$. 💪&lt;/p>
&lt;p>This requires two components:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>loss function&lt;/strong>: also called &lt;strong>cost function&lt;/strong>, a metric measures the distance between the system output and the gold output
&lt;ul>
&lt;li>The loss function that is commonly used for logistic regression and also for neural networks is &lt;strong>&lt;a href="https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/cross-entropy/">cross-entropy loss&lt;/a>&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Optimization algorithm&lt;/strong> for iteratively updating the weights so as to minimize this loss function
&lt;ul>
&lt;li>Standard algorithm: &lt;a href="https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/gradient-descent/">&lt;strong>gradient descent&lt;/strong>&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>Gradient Descent</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/gradient-descent/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/gradient-descent/</guid><description>&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>🎯 &lt;strong>Goal with gradient descent: find the optimal weights that minimize the loss function we&amp;rsquo;ve defined for the model.&lt;/strong>&lt;/p>
&lt;p>From now on, we’ll explicitly represent the fact that the loss function $L$ is parameterized by the weights $\theta$ (in the case of logistic regression $\theta=(w, b)$):
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmin}} \frac{1}{m} \sum\_{i=1}^{m} L_{C E}\left(y^{(i)}, x^{(i)} ; \theta\right)
$$
&lt;p>
Gradient descent finds a minimum of a function by figuring out in which direction (in the space of the parameters $\theta$) the function’s slope is rising the most steeply, and moving in the &lt;em>&lt;strong>opposite&lt;/strong>&lt;/em> direction.&lt;/p>
&lt;blockquote>
&lt;p>💡 Intuition&lt;/p>
&lt;p>if you are hiking in a canyon and trying to descend most quickly down to the river at the bottom, you might look around yourself 360 degrees, find the direction where the ground is sloping the steepest, and walk downhill in that direction.&lt;/p>
&lt;/blockquote>
&lt;p>For logistic regression, this loss function is conveniently &lt;strong>convex&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Just one minimum&lt;/li>
&lt;li>No local minima to get stuck in&lt;/li>
&lt;/ul>
&lt;p>$\Rightarrow$ Gradient descent starting from any point is guaranteed to find the minimum. 👏&lt;/p>
&lt;p>Visualization:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-28%2022.32.50.png" alt="截屏2020-05-28 22.32.50" style="zoom:70%;" />
&lt;p>The magnitude of the amount to move in gradient descent is the value of the slope $\frac{d}{d w} f(x ; w)$ weighted by a &lt;strong>learning rate&lt;/strong> $\eta$. A higher (faster) learning rate means that we should move &lt;em>w&lt;/em> more on each step.&lt;/p>
&lt;p>In the single-variable example above, The change we make in our parameter is
&lt;/p>
$$
w^{t+1}=w^{t}-\eta \frac{d}{d w} f(x ; w)
$$
&lt;p>
In $N$-dimensional space, the gradient is a vector that expresses the directinal components of the sharpest slope along each of those $N$ dimensions.&lt;/p>
&lt;p>Visualizaion (E.g., $N=2$):&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-28%2022.41.23.png" alt="截屏2020-05-28 22.41.23" style="zoom:80%;" />
&lt;p>In each dimension $w_i$, we express the slope as a &lt;strong>partial derivative&lt;/strong> $\frac{\partial}{\partial w_i}$ of the loss function. The gradient is defined as a vector of these partials:
&lt;/p>
$$
\left.\nabla_{\theta} L(f(x ; \theta), y)\right)=\left[\begin{array}{c}
\frac{\partial}{\partial w_{1}} L(f(x ; \theta), y) \\\\
\frac{\partial}{\partial w_{2}} L(f(x ; \theta), y) \\\\
\vdots \\\\
\frac{\partial}{\partial w_{n}} L(f(x ; \theta), y)
\end{array}\right]
$$
&lt;p>
Thus, the change of $\theta$ is:
&lt;/p>
$$
\theta_{t+1}=\theta_{t}-\eta \nabla L(f(x ; \theta), y)
$$
&lt;h2 id="the-gradient-for-logistic-regression">The gradient for Logistic Regression&lt;/h2>
&lt;p>For logistic regression, the cross-entropy loss function is
&lt;/p>
$$
L_{C E}(w, b)=-[y \log \sigma(w \cdot x+b)+(1-y) \log (1-\sigma(w \cdot x+b))]
$$
&lt;p>
The derivative of this loss function is:
&lt;/p>
$$
\frac{\partial L_{C E}(w, b)}{\partial w_{j}}=[\sigma(w \cdot x+b)-y] x_{j}
$$
&lt;blockquote>
&lt;p>For derivation of the derivative above we need:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>derivative of $\ln(x)$:
&lt;/p>
$$
> \frac{d}{d x} \ln (x)=\frac{1}{x}
> $$
&lt;/li>
&lt;li>
&lt;p>derivative of the sigmoid:
&lt;/p>
$$
> \frac{d \sigma(z)}{d z}=\sigma(z)(1-\sigma(z))
> $$
&lt;/li>
&lt;li>
&lt;p>Chain rule of derivative: for $f(x)=u(v(x))$,
&lt;/p>
$$
> \frac{d f}{d x}=\frac{d u}{d v} \cdot \frac{d v}{d x}
> $$
&lt;/li>
&lt;/ul>
&lt;p>Now compute the derivative:
&lt;/p>
$$
> \begin{aligned}
> \frac{\partial L L(w, b)}{\partial w_{j}} &amp;=\frac{\partial}{\partial w_{j}}-[y \log \sigma(w \cdot x+b)+(1-y) \log (1-\sigma(w \cdot x+b))] \\\\
> &amp;=-\frac{\partial}{\partial w_{j}} y \log \sigma(w \cdot x+b) - \frac{\partial}{\partial w_{j}}(1-y) \log [1-\sigma(w \cdot x+b)] \\\\
> &amp;\overset{\text{chain rule}}{=} -\frac{y}{\sigma(w \cdot x+b)} \frac{\partial}{\partial w_{j}} \sigma(w \cdot x+b)-\frac{1-y}{1-\sigma(w \cdot x+b)} \frac{\partial}{\partial w_{j}} 1-\sigma(w \cdot x+b)\\\\
> &amp;= -\left[\frac{y}{\sigma(w \cdot x+b)}-\frac{1-y}{1-\sigma(w \cdot x+b)}\right] \frac{\partial}{\partial w_{j}} \sigma(w \cdot x+b) \\\\
> \end{aligned}
> $$
&lt;p>Now plug in the derivative of the sigmoid, and use the chain rule one more time:
&lt;/p>
$$
> \begin{aligned}
> \frac{\partial L L(w, b)}{\partial w_{j}} &amp;=-\left[\frac{y-\sigma(w \cdot x+b)}{\sigma(w \cdot x+b)[1-\sigma(w \cdot x+b)]}\right] \sigma(w \cdot x+b)[1-\sigma(w \cdot x+b)] \frac{\partial(w \cdot x+b)}{\partial w_{j}} \\\\
> &amp;=-\left[\frac{y-\sigma(w \cdot x+b)}{\sigma(w \cdot x+b)[1-\sigma(w \cdot x+b)]}\right] \sigma(w \cdot x+b)[1-\sigma(w \cdot x+b)] x_{j} \\\\
> &amp;=-[y-\sigma(w \cdot x+b)] x_{j} \\\\
> &amp;=[\sigma(w \cdot x+b)-y] x_{j}
> \end{aligned}
> $$
&lt;/blockquote>
&lt;h2 id="stochastic-gradient-descent">Stochastic Gradient descent&lt;/h2>
&lt;p>Stochastic gradient descent is an online algorithm that minimizes the loss function by&lt;/p>
&lt;ul>
&lt;li>computing its gradient after each training example, and&lt;/li>
&lt;li>nudging $\theta$ in the right direction (the opposite direction of the gradient).&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-28%2023.01.53.png" alt="截屏2020-05-28 23.01.53" style="zoom:80%;" />
&lt;p>The learning rate η is a (hyper-)parameter that must be adjusted.&lt;/p>
&lt;ul>
&lt;li>If it’s too high, the learner will take steps that are too large, overshooting the minimum of the loss function.&lt;/li>
&lt;li>If it’s too low, the learner will take steps that are too small, and take too long to get to the minimum.&lt;/li>
&lt;/ul>
&lt;p>It is common to begin the learning rate at a higher value, and then slowly decrease it, so that it is a function of the iteration $k$ of training.&lt;/p>
&lt;h2 id="mini-batch-training">Mini-batch training&lt;/h2>
&lt;p>&lt;strong>Stochastic&lt;/strong> gradient descent: chooses a &lt;strong>single&lt;/strong> random example at a time, moving the weights so as to improve performance on that single example.&lt;/p>
&lt;ul>
&lt;li>Can result in very choppy movements&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Batch&lt;/strong> gradient descent: compute the gradient over the &lt;strong>entire&lt;/strong> dataset.&lt;/p>
&lt;ul>
&lt;li>Offers a superb estimate of which direction to move the weights&lt;/li>
&lt;li>Spends a lot of time processing every single example in the training set to compute this perfect direction.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Mini-batch&lt;/strong> gradient descent&lt;/p>
&lt;ul>
&lt;li>we train on a group of $m$ examples (perhaps 512, or 1024) that is less than the whole dataset.&lt;/li>
&lt;li>Has the advantage of computational efficiency
&lt;ul>
&lt;li>The mini-batches can easily be vectorized, choosing the size of the mini-batch based on the computational resources.&lt;/li>
&lt;li>This allows us to process all the exam- ples in one mini-batch in parallel and then accumulate the loss&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Define the mini-batch version of the cross-entropy loss function (assuming the training examples are independent):
&lt;/p>
$$
\begin{aligned}
\log p(\text {training labels}) &amp;=\log \prod\_{i=1} p\left(y^{(i)} | x^{(i)}\right) \\\\
&amp;=\sum\_{i=1}^{m} \log p\left(y^{(i)} | x^{(i)}\right) \\\\
&amp;=-\sum\_{i=1}^{m} L\_{C E}\left(\hat{y}^{(i)}, y^{(i)}\right)
\end{aligned}
$$
&lt;p>
The cost function for the mini-batch of $m$ examples is the &lt;strong>average loss&lt;/strong> for each example:
&lt;/p>
$$
\begin{aligned}
\operatorname{cost}(w, b) &amp;=\frac{1}{m} \sum\_{i=1}^{m} L_{C E}\left(\hat{y}^{(i)}, y^{(i)}\right) \\\\
&amp;=-\frac{1}{m} \sum\_{i=1}^{m} y^{(i)} \log \sigma\left(w \cdot x^{(i)}+b\right)+\left(1-y^{(i)}\right) \log \left(1-\sigma\left(w \cdot x^{(i)}+b\right)\right)
\end{aligned}
$$
&lt;p>
The mini-batch gradient is the average of the individual gradients:
&lt;/p>
$$
\frac{\partial \operatorname{cost}(w, b)}{\partial w_{j}}=\frac{1}{m} \sum_{i=1}^{m}\left[\sigma\left(w \cdot x^{(i)}+b\right)-y^{(i)}\right] x_{j}^{(i)}
$$
&lt;h2 id="heading">&lt;/h2></description></item><item><title>Regularization</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/regularization/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/regularization/</guid><description>&lt;h2 id="overfitting">Overfitting&lt;/h2>
&lt;p>🔴 Problem with learning weights that make the model perfectly match the training data:&lt;/p>
&lt;ul>
&lt;li>If a feature is perfectly predictive of the outcome because it happens to only occur in one class, it will be assigned a very high weight. The weights for features will attempt to perfectly fit details of the training set, &lt;em>in fact too perfectly&lt;/em>, modeling noisy factors that just accidentally correlate with the class. 🤪&lt;/li>
&lt;/ul>
&lt;p>This problem is called &lt;strong>overfitting&lt;/strong>.&lt;/p>
&lt;p>A good model should be able to &lt;strong>generalize well from the training data to the &lt;em>unseen&lt;/em> test set&lt;/strong>, but a model that overfits will have &lt;em>poor&lt;/em> generalization. &amp;#x1f622;&lt;/p>
&lt;h2 id="-solution-regularization">🔧 Solution: Regularization&lt;/h2>
&lt;p>Add a regularization term $R(\theta)$ to the objective function:
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}} \sum_{i=1}^{m} \log P\left(y^{(i)} | x^{(i)}\right)-\alpha R(\theta)
$$
&lt;ul>
&lt;li>$R(\theta)$: penalize large weights
&lt;ul>
&lt;li>a setting of the weights that matches the training data perfectly— but uses many weights with high values to do so—will be penalized more than a setting that matches the data a little less well, but does so using smaller weights.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Two common regularization terms:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>L2 regularization&lt;/strong> (Ridge regression)
&lt;/p>
$$
R(\theta)=\|\theta\|_{2}^{2}=\sum_{j=1}^{n} \theta_{j}^{2}
$$
&lt;ul>
&lt;li>
&lt;p>quadratic function of the weight values&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\|\theta\|_{2}^{2}$: L2 Norm, is the same as the Euclidean distance of the vector $\theta$ from the origin&lt;/p>
&lt;/li>
&lt;li>
&lt;p>L2 regularized objective function:
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}}\left[\sum_{1=i}^{m} \log P\left(y^{(i)} | x^{(i)}\right)\right]-\alpha \sum_{j=1}^{n} \theta_{j}^{2}
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>L1 regularization&lt;/strong> (Lasso regression)
&lt;/p>
$$
R(\theta)=\|\theta\|_{1}=\sum_{i=1}^{n}\left|\theta_{i}\right|
$$
&lt;ul>
&lt;li>
&lt;p>linear function of the weight values&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\|\theta\|_{1}$: L1 Norm, is the sum of the absolute values of the weights.&lt;/p>
&lt;ul>
&lt;li>Also called &lt;strong>Manhattan distance&lt;/strong> (the Manhattan distance is the distance you’d have to walk between two points in a city with a street grid like New York)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>L1 regularized objective function
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}}\left[\sum_{1=i}^{m} \log P\left(y^{(i)} | x^{(i)}\right)\right]-\alpha \sum_{j=1}^{n}\left|\theta_{j}\right|
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="l1--vs-l2-regularization">L1- Vs. L2-Regularization&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>L2 regularization is easier to optimize because of its simple derivative (the derivative of $\theta^2$ is just $2\theta$), while L1 regularization is more complex ((the derivative of $|\theta|$ is non-continuous at zero)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Where L2 prefers weight vectors with many small weights, L1 prefers sparse solutions with some larger weights but many more weights set to zero.&lt;/p>
&lt;ul>
&lt;li>Thus L1 regularization leads to much sparser weight vectors (far fewer features).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Both L1 and L2 regularization have Bayesian interpretations as constraints on the prior of how weights should look.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>L1 regularization can be viewed as a &lt;strong>Laplace prior&lt;/strong> on the weights.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>L2 regularization corresponds to assuming that weights are distributed according to a gaussian distribution with mean $μ = 0$.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>In a gaussian or normal distribution, the further away a value is from the mean, the lower its probability (scaled by the variance $σ$)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>By using a gaussian prior on the weights, we are saying that weights prefer to have the value 0.&lt;/p>
&lt;p>A gaussian for a weight $\theta_j$ is:&lt;/p>
&lt;p>$\frac{1}{\sqrt{2 \pi \sigma_{j}^{2}}} \exp \left(-\frac{\left(\theta_{j}-\mu_{j}\right)^{2}}{2 \sigma_{j}^{2}}\right)$&lt;/p>
&lt;p>If we multiply each weight by a gaussian prior on the weight, we are thus maximizing the following constraint:
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}} \prod_{i=1}^{M} P\left(y^{(i)} | x^{(i)}\right) \times \prod_{j=1}^{n} \frac{1}{\sqrt{2 \pi \sigma_{j}^{2}}} \exp \left(-\frac{\left(\theta_{j}-\mu_{j}\right)^{2}}{2 \sigma_{j}^{2}}\right)
$$
&lt;p>
In log space, with $\mu=0$, and assuming $2\sigma^2=1$, we get:
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}} \sum_{i=1}^{m} \log P\left(y^{(i)} | x^{(i)}\right)-\alpha \sum_{j=1}^{n} \theta_{j}^{2}
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>Multinomial Logistic Regression</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/multinomial-log-reg/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/multinomial-log-reg/</guid><description>&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>More than two classes?&lt;/p>
&lt;p>Use &lt;strong>multinomial logistic regression&lt;/strong> (also called &lt;strong>softmax regression&lt;/strong>, or &lt;strong>maxent classifier&lt;/strong>). The target $y$ is a variable that ranges over &lt;strong>more than two&lt;/strong> classes; we want to know the probability of $y$ being in each potential class $c \in C, p(y=c|x)$.&lt;/p>
&lt;p>We use the &lt;strong>softmax&lt;/strong> function to compute $p(y=c|x)$:&lt;/p>
&lt;ul>
&lt;li>Takes a vector $z=[z_1, z_2,\dots, z_k]$ of $k$ arbitrary values&lt;/li>
&lt;li>Maps them to a probability distribution
&lt;ul>
&lt;li>Each value $\in (0, 1)$&lt;/li>
&lt;li>All the values summing to $1$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>For a vector $z$ of dimensionality $k$, the softmax is:
&lt;/p>
$$
\operatorname{softmax}\left(z_{i}\right)=\frac{e^{z_{i}}}{\sum_{j=1}^{k} e^{z_{j}}} \qquad 1 \leq i \leq k
$$
&lt;p>
The softmax of an input vector $z=[z_1, z_2,\dots, z_k]$ is thus:
&lt;/p>
$$
\operatorname{softmax}(z)=\left[\frac{e^{z_{1}}}{\sum_{i=1}^{k} e^{z_{i}}}, \frac{e^{z_{2}}}{\sum_{i=1}^{k} e^{z_{i}}}, \ldots, \frac{e^{z_{k}}}{\sum_{i=1}^{k} e^{z_{i}}}\right]
$$
&lt;ul>
&lt;li>The denominator $\sum_{j=1}^{k} e^{z_{j}}$ is used to normalize all the values into probabilities.&lt;/li>
&lt;/ul>
&lt;p>Like the sigmoid, the input to the softmax will be the dot product between a weight vector $w$ and an input vector $x$ (plus a bias). But now we’ll need separate weight vectors (and bias) for each of the $K$ classes.
&lt;/p>
$$
p(y=c | x)=\frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}}
$$
&lt;h2 id="features-in-multinomial-logistic-regression">Features in Multinomial Logistic Regression&lt;/h2>
&lt;p>For multiclass classification, input features are:&lt;/p>
&lt;ul>
&lt;li>observation $x$&lt;/li>
&lt;li>candidate output class $c$&lt;/li>
&lt;/ul>
&lt;p>$\Rightarrow$ When we are discussing features we will use the notation $f_i(c, x)$: feature $i$ for a particular class $c$ for a given observation $x$&lt;/p>
&lt;h3 id="example">&lt;strong>Example&lt;/strong>&lt;/h3>
&lt;p>Suppose we are doing text classification, and instead of binary classification our task is to assign one of the 3 classes +, −, or 0 (neutral) to a document. Now a feature related to exclamation marks might have a negative weight for 0 documents, and a positive weight for + or − documents:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-05-29%2015.59.37-20200803151242332.png" alt="截屏2020-05-29 15.59.37" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="learning-in-multinomial-logistic-regression">Learning in Multinomial Logistic Regression&lt;/h2>
&lt;p>The loss function for a single example $x$ is the sum of the logs of the $K$ output classes:
&lt;/p>
$$
\begin{aligned}
L_{C E}(\hat{y}, y) &amp;=-\sum_{k=1}^{K} \mathbb{1}\\{y=k\\} \log p(y=k | x) \\\\
&amp;=-\sum_{k=1}^{K} \mathbb{1}\\{y=k\\} \log \frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}
\end{aligned}
$$
&lt;ul>
&lt;li>$1\{\}$: evaluates to $1$ if the condition in the brackets is true and to $0$ otherwise.&lt;/li>
&lt;/ul>
&lt;p>Gradient:
&lt;/p>
$$
\begin{aligned}
\frac{\partial L_{C E}}{\partial w_{k}} &amp;=-(\mathbb{1}\\{y=k\\}-p(y=k | x)) x_{k} \\\\
&amp;=-\left(\mathbb{1}\\{y=k\\}-\frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}\right) x_{k}
\end{aligned}
$$</description></item><item><title>Logistic Regression: Summry</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/logistic-regression_summary/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/logistic-regression_summary/</guid><description>&lt;ul>
&lt;li>
&lt;p>Supervised classification&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Input: $x = (x_1, x_2, \dots, x_n)^T$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Output: $y \in \{0, 1\}$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Parameters:&lt;/p>
&lt;ul>
&lt;li>Weight: $w = (w_1, w_2, \dots, w_n)^T$&lt;/li>
&lt;li>Bias $b$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Prediction&lt;/p>
$$
z = w \cdot x + b \\
P(y=1|x)=\sigma(z) = \frac{1}{1+e^{-z}}\\
y=\left\{\begin{array}{ll}
1 &amp; \text { if } P(y=1 | x)>0.5 \\
0 &amp; \text { otherwise }
\end{array}\right.
$$
&lt;/li>
&lt;li>
&lt;p>Training/Learning&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Loss function&lt;/p>
&lt;ul>
&lt;li>
&lt;p>For a single sample $x$
&lt;/p>
$$
\hat{y} = \sigma(w \cdot x + b)
$$
&lt;p>
And we define $\hat{y}:=P(y=1|x)$&lt;/p>
&lt;p>$y \in \{0, 1\} \Rightarrow$&lt;/p>
$$
P(y | x)=\left\{\begin{array}{lr}
\hat{y} &amp; y=1 \\
1-\hat{y} &amp; y=0
\end{array}\right.
$$
&lt;p>The probability of correct prediction can thus be expressed as:
$$
P(y|x)=\hat{y}^y (1-\hat{y})^{1-y}
$$
We want to maximize $P(y|x)$
$$
\begin{array}{ll}
&amp;\max \quad P(y|x) \\
\equiv &amp;\max \quad \log(P(y|x)) \\
= &amp;\max \quad \log(\hat{y}^y (1-\hat{y})^{1-y})\\
= &amp;\max \quad y\log(\hat{y}) + (1-y)\log(1-\hat{y}) \\
\equiv &amp;\min \quad -[y \log \hat{y}+(1-y) \log (1-\hat{y})] \\
= &amp;\min \quad \underbrace{-[y \log \sigma(w \cdot x + b) + (1-y) \log (1-\sigma(w \cdot x + b))]}_{=:L_{CE}(w, b)} \\
\end{array}
$$
$L_{CE}(w, b)$ is called the &lt;strong>Cross-Entropy loss&lt;/strong>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>For a mini-batch of samples of size $m$&lt;/p>
&lt;ul>
&lt;li>
&lt;p>$(x^{(i)}, y^{(i)})$: $i$-th Training sample&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Loss function is the &lt;strong>average loss&lt;/strong> for each example&lt;/p>
$$
L(w, b) = =-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \sigma\left(w \cdot x^{(i)}+b\right)+\left(1-y^{(i)}\right) \log \left(1-\sigma\left(w \cdot x^{(i)}+b\right)\right)
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Algorithm: &lt;strong>Gradient descent&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Gradient for single sample:
$$
\frac{\partial L_{C E}(w, b)}{\partial w_{j}}=[\sigma(w \cdot x+b)-y] x_{j}
$$
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Gradient for mini-batch:
$$
\frac{\partial L(w, b)}{\partial w_{j}}=\frac{1}{m} \sum_{i=1}^{m}\left[\sigma\left(w \cdot x^{(i)}+b\right)-y^{(i)}\right] x_{j}^{(i)}
$$
&lt;/p>
&lt;ul>
&lt;li>$x_j^{(i)}$: $j$-th feature of the $i$-th sample&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="multinomial-logistic-regression">Multinomial Logistic Regression&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Also called &lt;strong>Softmax regression&lt;/strong>, &lt;strong>MaxEnt classifier&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Softmax function
$$
\operatorname{softmax}\left(z_{i}\right)=\frac{e^{z_{i}}}{\sum_{j=1}^{k} e^{z_{j}}} \qquad 1 \leq i \leq k
$$
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Compute the probability of $y$ being in each potential class $c \in C$, $p(y=c|x)$, using softmax function:
$$
p(y=c | x)=\frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}}
$$
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Prediction
$$
\begin{array}{ll}
\hat{c} &amp;= \underset{c}{\arg \max} \quad p(y=c | x) \\
&amp;= \underset{c}{\arg \max} \quad \frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}}
\end{array}
$$
&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Learning&lt;/p>
&lt;ul>
&lt;li>
&lt;p>For a singnle sample $x$, the loss function is
$$
\begin{aligned}
L_{C E}(w, b) &amp;=-\sum_{k=1}^{K} 1\{y=k\} \log p(y=k | x) \\
&amp;=-\sum_{k=1}^{K} 1\{y=k\} \log \frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}
\end{aligned}
$$
&lt;/p>
&lt;ul>
&lt;li>$1\{\}$: evaluates to $1$ if the condition in the brackets is true and to $0$ otherwise.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Gradient
$$
\begin{aligned}
\frac{\partial L_{C E}}{\partial w_{k}} &amp;=-(1\{y=k\}-p(y=k | x)) x_{k} \\
&amp;=-\left(1\{y=k\}-\frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}\right) x_{k}
\end{aligned}
$$
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>Logistic Regression in NLP</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/logistic_regression/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/logistic-regression/logistic_regression/</guid><description>&lt;h1 id="logistic-regression-in-nlp">Logistic Regression (in NLP)&lt;/h1>
&lt;p>In natural language processing, logistic regression is the &lt;strong>base-line supervised machine learning algorithm for classification&lt;/strong>, and also has a very close relationship with neural networks.&lt;/p>
&lt;h2 id="generative-and-discriminative-classifier">Generative and Discriminative Classifier&lt;/h2>
&lt;p>The most important difference between naive Bayes and logistic regression is that&lt;/p>
&lt;ul>
&lt;li>logistic regression is a &lt;strong>discriminative&lt;/strong> classifier while&lt;/li>
&lt;li>naive Bayes is a &lt;strong>generative&lt;/strong> classifier.&lt;/li>
&lt;/ul>
&lt;p>Consider a visual metaphor: imagine we’re trying to distinguish dog images from cat images.&lt;/p>
&lt;ul>
&lt;li>Generative model&lt;/li>
&lt;li>Try to understand what dogs look like and what cats look like&lt;/li>
&lt;li>You might literally ask such a model to ‘generate’, i.e. draw, a dog&lt;/li>
&lt;li>Given a test image, the system then asks whether it’s the cat model or the dog model that better fits (is less surprised by) the image, and chooses that as its label.&lt;/li>
&lt;li>disciminative model
&lt;ul>
&lt;li>only trying to learn to distinguish the classes&lt;/li>
&lt;li>So maybe all the dogs in the training data are wearing collars and the cats aren’t. If that one feature neatly separates the classes, the model is satisfied. If you ask such a model what it knows about cats all it can say is that they don’t wear collars.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>More formally, recall that the naive Bayes assigns a class $c$ to a document $d$ NOT by directly computing $p(c|d)$ but by computing a likelihood and a prior
&lt;/p>
$$
\hat{c}=\underset{c \in C}{\operatorname{argmax}} \overbrace{P(d | c)}^{\text { likelihood prior }} \overbrace{P(c)}^{\text { prior }}
$$
&lt;ul>
&lt;li>
&lt;p>Generative model (like naive Bayes)&lt;/p>
&lt;ul>
&lt;li>Makes use of the likelihood term
&lt;ul>
&lt;li>Expresses how to generate the features of a document &lt;em>if we knew it was of class&lt;/em> $c$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Discriminative model&lt;/p>
&lt;ul>
&lt;li>attempts to directly compute $P(c|d)$&lt;/li>
&lt;li>It will learn to assign a high weight to document features that directly improve its ability to &lt;em>discriminate&lt;/em> between possible classes, even if it couldn’t generate an example of one of the classes.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="components-of-a-probabilistic-machine-learning-classifier">Components of a probabilistic machine learning classifier&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Training corpus of $M$ input/output pairs $(x^{(i)}, y^{(i)})$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>A &lt;strong>feature representation&lt;/strong> of the input&lt;/p>
&lt;ul>
&lt;li>For each input observation $x^{(i)}$, this will be a vector of features $[x_1, x_2, \dots, x_n]$
&lt;ul>
&lt;li>$x_{i}^{(j)}$: feature $i$ for input $x^{(j)}$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>A &lt;strong>classification function&lt;/strong> that computes $\hat{y}$, the estimated class, via $p(y|x)$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An objective function for learning, usually involving minimizing error on&lt;/p>
&lt;p>training examples&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An algorithm for optimizing the objective function.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Logistic regression has two phases:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>training&lt;/strong>: we train the system (specifically the weights $w$ and $b$) using stochastic gradient descent and the cross-entropy loss.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>test&lt;/strong>: Given a test example $x$ we compute $p(y|x)$ and return the higher probability label $y=1$ or $y=0$.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="classification-the-sigmoid">Classification: the sigmoid&lt;/h2>
&lt;p>Consider a single input observation $x = [x_1, x_2, \dots, x_n]$&lt;/p>
&lt;p>The classifier output $y$ can be&lt;/p>
&lt;ul>
&lt;li>$1$: the observation is a member of the class&lt;/li>
&lt;li>$0$: the observation is NOT a member of the class&lt;/li>
&lt;/ul>
&lt;p>We want to know the &lt;strong>probability&lt;/strong> $P(y=1|x)$ that this observation is a member of the class.&lt;/p>
&lt;p>&lt;em>E.g.:&lt;/em>&lt;/p>
&lt;ul>
&lt;li>&lt;em>The decision is “positive sentiment” versus “negative sentiment”&lt;/em>&lt;/li>
&lt;li>&lt;em>the features represent counts of words in a document&lt;/em>&lt;/li>
&lt;li>&lt;em>$P(y=1|x)$ is the probability that the document has positive sentiment, while and $P(y=0|x)$ is the probability that the document has negative sentiment.&lt;/em>&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Logistic regression solves this task by learning, from a training set, a vector of weights and a bias term.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Each &lt;strong>weight&lt;/strong> $w_i$ is a real number, and is associated with one of the input features $x_i$. The weight represents how important that input feature is to the classification decision, can be&lt;/p>
&lt;ul>
&lt;li>positive (meaning the feature is associated with the class)&lt;/li>
&lt;li>negative (meaning the feature is NOT associated with the class).&lt;/li>
&lt;/ul>
&lt;p>&lt;em>E.g.: we might expect in a sentiment task the word &lt;u>awesome&lt;/u> to have a high positive weight, and &lt;u>abysmal&lt;/u> to have a very negative weight.&lt;/em>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Bias term $b$, also called the &lt;strong>intercept&lt;/strong>, is another real number that’s added to the weighted inputs.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>To make a decision on a test instance, the resulting single number $z$ expresses the weighted sum of the evidence for the class:
&lt;/p>
$$
\begin{array}{ll}
z &amp;=\left(\sum_{i=1}^{n} w_{i} x_{i}\right)+b \\
&amp; = w \cdot x + b \\
&amp; \in (-\infty, \infty)
\end{array}
$$
&lt;p>
(Note that $z$ is NOT a legal probability, since $z \notin [0, 1]$)&lt;/p>
&lt;p>To create a probability, we’ll pass $z$ through the &lt;strong>sigmoid&lt;/strong> function (also called &lt;strong>logistic function&lt;/strong>):
&lt;/p>
$$
y=\sigma(z)=\frac{1}{1+e^{-z}}
$$
&lt;p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-05-28%2010.12.24.png" alt="截屏2020-05-28 10.12.24">&lt;/p>
&lt;p>👍 Advantages of sigmoid:&lt;/p>
&lt;ul>
&lt;li>It takes a real-valued number and maps it into the range [0,1] (which is just what we want for a probability)&lt;/li>
&lt;li>It is nearly linear around 0 but has a sharp slope toward the ends, it tends to squash outlier values toward 0 or 1.&lt;/li>
&lt;li>Differentiable $\Rightarrow$ handy for learning&lt;/li>
&lt;/ul>
&lt;p>To make it a probability, we just need to make sure that the two cases, $P(y=1)$ and $P(y=0)$, sum to 1:
&lt;/p>
$$
\begin{aligned}
P(y=1) &amp;=\sigma(w \cdot x+b) \\
&amp;=\frac{1}{1+e^{-(w \cdot x+b)}} \\
P(y=0) &amp;=1-\sigma(w \cdot x+b) \\
&amp;=1-\frac{1}{1+e^{-(w \cdot x+b)}} \\
&amp;=\frac{e^{-(w \cdot x+b)}}{1+e^{-(w \cdot x+b)}}
\end{aligned}
$$
&lt;p>
Now we have an algorithm that given an instance $x$ computes the probability $P(y=1|x)$. For a test instance $x$, we say yes if the probability is $P(y=1|x)$ more than 0.5, and no otherwise. We call 0.5 the &lt;strong>decision boundary&lt;/strong>:
&lt;/p>
$$
\text{predict class}=\left\{\begin{array}{ll}
1 &amp; \text { if } P(y=1 | x)>0.5 \\
0 &amp; \text { otherwise }
\end{array}\right.
$$
&lt;h3 id="example-sentiment-classification">Example: sentiment classification&lt;/h3>
&lt;p>Suppose we are doing binary sentiment classification on movie review text, and we would like to know whether to assign the sentiment class + or − to a review document $doc$.&lt;/p>
&lt;p>We’ll represent each input observation by the 6 features $x_1,...,x_6$ of the input shown in the following table&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-28%2010.32.22.png" alt="截屏2020-05-28 10.32.22" style="zoom:80%;" />
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-28%2010.32.56.png" alt="截屏2020-05-28 10.32.56" style="zoom:80%;" />
&lt;p>Assume that for the moment that we’ve already learned a real-valued weight for each of these features, and that the 6 weights corresponding to the 6 features are $w= [2.5,−5.0,−1.2,0.5,2.0,0.7]$, while $b = 0.1$.&lt;/p>
&lt;ul>
&lt;li>The weight $w_1$, for example indicates how important a feature the number of positive lexicon words (&lt;em>great&lt;/em>, &lt;em>nice&lt;/em>, &lt;em>enjoyable&lt;/em>, etc.) is to a positive sentiment decision, while $w_2$tells us the importance of negative lexicon words. Note that $w_1 = 2.5$ is positive, while $w_2 = −5.0$, meaning that negative words are negatively associated with a positive sentiment decision, and are &lt;strong>about twice as important as positive words&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;p>Given these 6 features and the input review $x$, $P(+|x)$ and $P(-|x)$ can be computed:
&lt;/p>
$$
\begin{aligned}
p(+| x)=P(Y=1 | x) &amp;=\sigma(w \cdot x+b) \\
&amp;=\sigma([2.5,-5.0,-1.2,0.5,2.0,0.7] \cdot[3,2,1,3,0,4.19]+0.1) \\
&amp;=\sigma(0.833) \\
&amp;=0.70 \\
p(-| x)=P(Y=0 | x) &amp;=1-\sigma(w \cdot x+b) \\
&amp;=0.30
\end{aligned}
$$
&lt;p>
$0.70 > 0.50 \Rightarrow$ This sentiment is positive ($+$).&lt;/p>
&lt;h2 id="learning-in-logistic-regression">Learning in Logistic Regression&lt;/h2>
&lt;p>Logistic regression is an instance of supervised classification in which we know the correct label $y$ (either 0 or 1) for each observation $x$.&lt;/p>
&lt;p>The system produces/predicts $\hat{y}$, the estimate for the true $y$. We want to learn parameters ($w$ and $b$) that make $\hat{y}$ for each training observation &lt;strong>as close as possible&lt;/strong> to the true $y$. 💪&lt;/p>
&lt;p>This requires two components:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>loss function&lt;/strong>: also called &lt;strong>cost function&lt;/strong>, a metric measures the distance between the system output and the gold output
&lt;ul>
&lt;li>The loss function that is commonly used for logistic regression and also for neural networks is &lt;strong>cross-entropy loss&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Optimization algorithm&lt;/strong> for iteratively updating the weights so as to minimize this loss function
&lt;ul>
&lt;li>Standard algorithm: &lt;strong>gradient descent&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="the-cross-entropy-loss-function">The Cross-Entropy Loss Function&lt;/h2>
&lt;p>We need a loss function that expresses, for an observation $x$, how close the classifier output ($\hat{y}=\sigma(w \cdot x+b)$) is to the correct output ($y$, which is $0$ or $1$):
&lt;/p>
$$
L(\hat{y}, y)= \text{How much } \hat{y} \text{ differs from the true } y
$$
&lt;p>
This loss function should prefer the correct class labels of the training examples to be &lt;em>more likely&lt;/em>.&lt;/p>
&lt;p>👆 This is called &lt;strong>conditional maximum likelihood estimation&lt;/strong>: we choose the parameters $w, b$ that maximize the log probability of the true $y$ labels in the training data given the observations $x$. The resulting loss function is the &lt;em>negative&lt;/em> log likelihood loss, generally called the &lt;strong>cross-entropy loss&lt;/strong>.&lt;/p>
&lt;h3 id="derivation">Derivation&lt;/h3>
&lt;p>Task: for a single observation $x$, learn weights that maximize $p(y|x)$, the probability of the correct label&lt;/p>
&lt;p>There&amp;rsquo;re only two discretions outcomes ($1$ or $0$)&lt;/p>
&lt;p>$\Rightarrow$ This is a &lt;strong>Bernoulli distribution&lt;/strong>. The probability $p(y|x)$ for one observation can be expressed as:
&lt;/p>
$$
p(y | x)=\hat{y}^{y}(1-\hat{y})^{1-y}
$$
&lt;ul>
&lt;li>$y=1, p(y|x)=\hat{y}$&lt;/li>
&lt;li>$y=0, p(y|x)=1-\hat{y}$&lt;/li>
&lt;/ul>
&lt;p>Now we take the log of both sides. This will turn out to be handy mathematically, and doesn’t hurt us (whatever values maximize a probability will also maximize the log of the probability):
&lt;/p>
$$
\begin{aligned}
\log p(y | x) &amp;=\log \left[\hat{y}^{y}(1-\hat{y})^{1-y}\right] \\
&amp;=y \log \hat{y}+(1-y) \log (1-\hat{y})
\end{aligned}
$$
&lt;p>
👆 This is the log likelihood that should be maximized.&lt;/p>
&lt;p>In order to turn this into loss function (something that we need to minimize), we’ll just flip the sign. The result is the &lt;strong>cross-entropy loss&lt;/strong>:
&lt;/p>
$$
L_{C E}(\hat{y}, y)=-\log p(y | x)=-[y \log \hat{y}+(1-y) \log (1-\hat{y})]
$$
&lt;p>
Recall that $\hat{y}=\sigma(w \cdot x+b)$:
&lt;/p>
$$
L_{C E}(w, b)=-[y \log \sigma(w \cdot x+b)+(1-y) \log (1-\sigma(w \cdot x+b))]
$$
&lt;h3 id="example">Example&lt;/h3>
&lt;p>Let’s see if this loss function does the right thing for example above.&lt;/p>
&lt;p>We want the loss to be&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>smaller&lt;/strong> if the model’s estimate is &lt;strong>close to correct&lt;/strong>, and&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>bigger&lt;/strong> if the model is &lt;strong>confused&lt;/strong>.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Let’s suppose the correct gold label for the sentiment example above is positive, i.e.: $y=1$.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>In this case our model is doing well 👏, since it gave the example a a higher probability of being positive ($0.70$) than negative ($0.30$).&lt;/p>
&lt;p>If we plug $\sigma(w \cdot x+b)=0.70$ and $y=1$ into the cross-entropy loss, we get&lt;/p>
&lt;/li>
&lt;/ul>
$$
\begin{aligned}
L_{C E}(w, b) &amp;=-[y \log \sigma(w \cdot x+b)+(1-y) \log (1-\sigma(w \cdot x+b))] \\
&amp;=-[\log \sigma(w \cdot x+b)] \\
&amp;=-\log (0.70) \\
&amp;=0.36
\end{aligned}
$$
&lt;p>By contrast, let&amp;rsquo;s pretend instead that the example was negative, i.e.: $y=0$.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>In this case our model is confused 🤪, and we’d want the loss to be higher.&lt;/p>
&lt;p>If we plug $y=0$ and $1-\sigma(w \cdot x+b)=0.30$ into the cross-entropy loss, we get
&lt;/p>
$$
\begin{aligned}
L_{C E}(w, b) &amp;=-[y \log \sigma(w \cdot x+b)+(1-y) \log (1-\sigma(w \cdot x+b))] \\
&amp;= -[\log (1-\sigma(w \cdot x+b))] \\
&amp;=-\log (.31) \\
&amp;= 1.20
\end{aligned}
$$
&lt;/li>
&lt;/ul>
&lt;p>It&amp;rsquo;s obvious that the lost for the first classifier ($0.36$) is less than the loss for the second classifier ($1.17$).&lt;/p>
&lt;h3 id="why-minimizing-this-negative-log-probability-works">Why minimizing this negative log probability works?&lt;/h3>
&lt;p>A perfect classifier would assign probability $1$ to the correct outcome and probability $0$ to the incorrect outcome. That means:&lt;/p>
&lt;ul>
&lt;li>the higher $\hat{y}$ (the closer it is to 1), the better the classifier;&lt;/li>
&lt;li>the lower $\hat{y}$ is (the closer it is to 0), the worse the classifier.&lt;/li>
&lt;/ul>
&lt;p>The negative log of this probability is a convenient loss metric since it goes from 0 (negative log of 1, no loss) to infinity (negative log of 0, infinite loss). This loss function also ensures that as the probability of the correct answer is maximized, the probability of the incorrect answer is minimized; since the two sum to one, any increase in the probability of the correct answer is coming at the expense of the incorrect answer.&lt;/p>
&lt;h2 id="gradient-descent">Gradient Descent&lt;/h2>
&lt;p>Goal with gradient descent: find the optimal weights that minimize the loss function we&amp;rsquo;ve defined for the model. From now on, we’ll explicitly represent the fact that the loss function $L$ is parameterized by the weights $\theta$ (in the case of logistic regression $\theta=(w, b)$):
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmin}} \frac{1}{m} \sum_{i=1}^{m} L_{C E}\left(y^{(i)}, x^{(i)} ; \theta\right)
$$
&lt;p>
Gradient descent finds a minimum of a function by figuring out in which direction (in the space of the parameters $\theta$) the function’s slope is rising the most steeply, and moving in the &lt;em>&lt;strong>opposite&lt;/strong>&lt;/em> direction.&lt;/p>
&lt;blockquote>
&lt;p>💡 Intuition&lt;/p>
&lt;p>if you are hiking in a canyon and trying to descend most quickly down to the river at the bottom, you might look around yourself 360 degrees, find the direction where the ground is sloping the steepest, and walk downhill in that direction.&lt;/p>
&lt;/blockquote>
&lt;p>For logistic regression, this loss function is conveniently &lt;strong>convex&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Just one minimum&lt;/li>
&lt;li>No local minima to get stuck in&lt;/li>
&lt;/ul>
&lt;p>$\Rightarrow$ Gradient descent starting from any point is guaranteed to find the minimum.&lt;/p>
&lt;p>Visualization:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-28%2022.32.50.png" alt="截屏2020-05-28 22.32.50" style="zoom:70%;" />
&lt;p>The magnitude of the amount to move in gradient descent is the value of the slope $\frac{d}{d w} f(x ; w)$ weighted by a &lt;strong>learning rate&lt;/strong> $\eta$. A higher (faster) learning rate means that we should move &lt;em>w&lt;/em> more on each step.&lt;/p>
&lt;p>In the single-variable example above, The change we make in our parameter is
&lt;/p>
$$
w^{t+1}=w^{t}-\eta \frac{d}{d w} f(x ; w)
$$
&lt;p>
In $N$-dimensional space, the gradient is a vector that expresses the directinal components of the sharpest slope along each of those $N$ dimensions.&lt;/p>
&lt;p>Visualizaion (E.g., $N=2$):&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-28%2022.41.23.png" alt="截屏2020-05-28 22.41.23" style="zoom:80%;" />
&lt;p>In each dimension $w_i$, we express the slope as a &lt;strong>partial derivative&lt;/strong> $\frac{\partial}{\partial w_i}$ of the loss function. The gradient is defined as a vector of these partials:
&lt;/p>
$$
\left.\nabla_{\theta} L(f(x ; \theta), y)\right)=\left[\begin{array}{c}
\frac{\partial}{\partial w_{1}} L(f(x ; \theta), y) \\
\frac{\partial}{\partial w_{2}} L(f(x ; \theta), y) \\
\vdots \\
\frac{\partial}{\partial w_{n}} L(f(x ; \theta), y)
\end{array}\right]
$$
&lt;p>
Thus, the change of $\theta$ is:
&lt;/p>
$$
\theta_{t+1}=\theta_{t}-\eta \nabla L(f(x ; \theta), y)
$$
&lt;h3 id="the-gradient-for-logistic-regression">The gradient for Logistic Regression&lt;/h3>
&lt;p>For logistic regression, the cross-entropy loss function is
&lt;/p>
$$
L_{C E}(w, b)=-[y \log \sigma(w \cdot x+b)+(1-y) \log (1-\sigma(w \cdot x+b))]
$$
&lt;p>
The derivative of this loss function is:
&lt;/p>
$$
\frac{\partial L_{C E}(w, b)}{\partial w_{j}}=[\sigma(w \cdot x+b)-y] x_{j}
$$
&lt;blockquote>
&lt;p>For derivation of the derivative above we need:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>derivative of $\ln(x)$:
&lt;/p>
$$
> \frac{d}{d x} \ln (x)=\frac{1}{x}
> $$
&lt;/li>
&lt;li>
&lt;p>derivative of the sigmoid:
&lt;/p>
$$
> \frac{d \sigma(z)}{d z}=\sigma(z)(1-\sigma(z))
> $$
&lt;/li>
&lt;li>
&lt;p>Chain rule of derivative: for $f(x)=u(v(x))$,
&lt;/p>
$$
> \frac{d f}{d x}=\frac{d u}{d v} \cdot \frac{d v}{d x}
> $$
&lt;/li>
&lt;/ul>
&lt;p>Now compute the derivative:
&lt;/p>
$$
> \begin{aligned}
> \frac{\partial L L(w, b)}{\partial w_{j}} &amp;=\frac{\partial}{\partial w_{j}}-[y \log \sigma(w \cdot x+b)+(1-y) \log (1-\sigma(w \cdot x+b))] \\
> &amp;=-\frac{\partial}{\partial w_{j}} y \log \sigma(w \cdot x+b) - \frac{\partial}{\partial w_{j}}(1-y) \log [1-\sigma(w \cdot x+b)] \\
> &amp;\overset{\text{chain rule}}{=} -\frac{y}{\sigma(w \cdot x+b)} \frac{\partial}{\partial w_{j}} \sigma(w \cdot x+b)-\frac{1-y}{1-\sigma(w \cdot x+b)} \frac{\partial}{\partial w_{j}} 1-\sigma(w \cdot x+b)\\
> &amp;= -\left[\frac{y}{\sigma(w \cdot x+b)}-\frac{1-y}{1-\sigma(w \cdot x+b)}\right] \frac{\partial}{\partial w_{j}} \sigma(w \cdot x+b) \\
> \end{aligned}
> $$
&lt;p>Now plug in the derivative of the sigmoid, and use the chain rule one more time:
&lt;/p>
$$
> \begin{aligned}
> \frac{\partial L L(w, b)}{\partial w_{j}} &amp;=-\left[\frac{y-\sigma(w \cdot x+b)}{\sigma(w \cdot x+b)[1-\sigma(w \cdot x+b)]}\right] \sigma(w \cdot x+b)[1-\sigma(w \cdot x+b)] \frac{\partial(w \cdot x+b)}{\partial w_{j}} \\
> &amp;=-\left[\frac{y-\sigma(w \cdot x+b)}{\sigma(w \cdot x+b)[1-\sigma(w \cdot x+b)]}\right] \sigma(w \cdot x+b)[1-\sigma(w \cdot x+b)] x_{j} \\
> &amp;=-[y-\sigma(w \cdot x+b)] x_{j} \\
> &amp;=[\sigma(w \cdot x+b)-y] x_{j}
> \end{aligned}
> $$
&lt;/blockquote>
&lt;h3 id="stochastic-gradient-descent">Stochastic Gradient descent&lt;/h3>
&lt;p>Stochastic gradient descent is an online algorithm that minimizes the loss function by&lt;/p>
&lt;ul>
&lt;li>computing its gradient after each training example, and&lt;/li>
&lt;li>nudging $\theta$ in the right direction (the opposite direction of the gradient).&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-28%2023.01.53.png" alt="截屏2020-05-28 23.01.53" style="zoom:80%;" />
&lt;p>The learning rate η is a (hyper-)parameter that must be adjusted.&lt;/p>
&lt;ul>
&lt;li>If it’s too high, the learner will take steps that are too large, overshooting the minimum of the loss function.&lt;/li>
&lt;li>If it’s too low, the learner will take steps that are too small, and take too long to get to the minimum.&lt;/li>
&lt;/ul>
&lt;p>It is common to begin the learning rate at a higher value, and then slowly decrease it, so that it is a function of the iteration $k$ of training.&lt;/p>
&lt;h3 id="mini-batch-training">Mini-batch training&lt;/h3>
&lt;p>&lt;strong>Stochastic&lt;/strong> gradient descent: chooses a &lt;strong>single&lt;/strong> random example at a time, moving the weights so as to improve performance on that single example.&lt;/p>
&lt;ul>
&lt;li>Can result in very choppy movements&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Batch&lt;/strong> gradient descent: compute the gradient over the &lt;strong>entire&lt;/strong> dataset.&lt;/p>
&lt;ul>
&lt;li>Offers a superb estimate of which direction to move the weights&lt;/li>
&lt;li>Spends a lot of time processing every single example in the training set to compute this perfect direction.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Mini-batch&lt;/strong> gradient descent&lt;/p>
&lt;ul>
&lt;li>we train on a group of $m$ examples (perhaps 512, or 1024) that is less than the whole dataset.&lt;/li>
&lt;li>Has the advantage of computational efficiency
&lt;ul>
&lt;li>The mini-batches can easily be vectorized, choosing the size of the mini-batch based on the computational resources.&lt;/li>
&lt;li>This allows us to process all the exam- ples in one mini-batch in parallel and then accumulate the loss&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Define the mini-batch version of the cross-entropy loss function (assuming the training examples are independent):
&lt;/p>
$$
\begin{aligned}
\log p(\text {training labels}) &amp;=\log \prod_{i=1} p\left(y^{(i)} | x^{(i)}\right) \\
&amp;=\sum_{i=1}^{m} \log p\left(y^{(i)} | x^{(i)}\right) \\
&amp;=-\sum_{i=1}^{m} L_{C E}\left(\hat{y}^{(i)}, y^{(i)}\right)
\end{aligned}
$$
&lt;p>
The cost function for the mini-batch of $m$ examples is the &lt;strong>average loss&lt;/strong> for each example:
&lt;/p>
$$
\begin{aligned}
\operatorname{cost}(w, b) &amp;=\frac{1}{m} \sum_{i=1}^{m} L_{C E}\left(\hat{y}^{(i)}, y^{(i)}\right) \\
&amp;=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \sigma\left(w \cdot x^{(i)}+b\right)+\left(1-y^{(i)}\right) \log \left(1-\sigma\left(w \cdot x^{(i)}+b\right)\right)
\end{aligned}
$$
&lt;p>
The mini-batch gradient is the average of the individual gradients:
&lt;/p>
$$
\frac{\partial \operatorname{cost}(w, b)}{\partial w_{j}}=\frac{1}{m} \sum_{i=1}^{m}\left[\sigma\left(w \cdot x^{(i)}+b\right)-y^{(i)}\right] x_{j}^{(i)}
$$
&lt;h2 id="regularization">Regularization&lt;/h2>
&lt;p>🔴 There is a problem with learning weights that make the model perfectly match the training data:&lt;/p>
&lt;ul>
&lt;li>If a feature is perfectly predictive of the outcome because it happens to only occur in one class, it will be assigned a very high weight. The weights for features will attempt to perfectly fit details of the training set, &lt;em>in fact too perfectly&lt;/em>, modeling noisy factors that just accidentally correlate with the class. 🤪&lt;/li>
&lt;/ul>
&lt;p>This problem is called &lt;strong>overfitting&lt;/strong>.&lt;/p>
&lt;p>A good model should be able to &lt;strong>generalize well from the training data to the &lt;em>unseen&lt;/em> test set&lt;/strong>, but a model that overfits will have poor generalization.&lt;/p>
&lt;p>🔧 Solution: Add a regularization term $R(\theta)$ to the objective function:
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}} \sum_{i=1}^{m} \log P\left(y^{(i)} | x^{(i)}\right)-\alpha R(\theta)
$$
&lt;ul>
&lt;li>$R(\theta)$: penalize large weights
&lt;ul>
&lt;li>a setting of the weights that matches the training data perfectly— but uses many weights with high values to do so—will be penalized more than a setting that matches the data a little less well, but does so using smaller weights.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Two common regularization terms:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>L2 regularization&lt;/strong> (Ridge regression)
&lt;/p>
$$
R(\theta)=\|\theta\|_{2}^{2}=\sum_{j=1}^{n} \theta_{j}^{2}
$$
&lt;ul>
&lt;li>
&lt;p>quadratic function of the weight values&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\|\theta\|_{2}^{2}$: L2 Norm, is the same as the Euclidean distance of the vector $\theta$ from the origin&lt;/p>
&lt;/li>
&lt;li>
&lt;p>L2 regularized objective function:
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}}\left[\sum_{1=i}^{m} \log P\left(y^{(i)} | x^{(i)}\right)\right]-\alpha \sum_{j=1}^{n} \theta_{j}^{2}
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>L1 regularization&lt;/strong> (Lasso regression)
&lt;/p>
$$
R(\theta)=\|\theta\|_{1}=\sum_{i=1}^{n}\left|\theta_{i}\right|
$$
&lt;ul>
&lt;li>
&lt;p>linear function of the weight values&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\|\theta\|_{1}$: L1 Norm, is the sum of the absolute values of the weights.&lt;/p>
&lt;ul>
&lt;li>Also called &lt;strong>Manhattan distance&lt;/strong> (the Manhattan distance is the distance you’d have to walk between two points in a city with a street grid like New York)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>L1 regularized objective function
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}}\left[\sum_{1=i}^{m} \log P\left(y^{(i)} | x^{(i)}\right)\right]-\alpha \sum_{j=1}^{n}\left|\theta_{j}\right|
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="l1-vs-l2">L1 Vs. L2&lt;/h3>
&lt;ul>
&lt;li>L2 regularization is easier to optimize because of its simple derivative (the derivative of $\theta^2$ is just $2\theta$), while L1 regularization is more complex ((the derivative of $|\theta|$ is non-continuous at zero)&lt;/li>
&lt;li>Where L2 prefers weight vectors with many small weights, L1 prefers sparse solutions with some larger weights but many more weights set to zero.
&lt;ul>
&lt;li>Thus L1 regularization leads to much sparser weight vectors (far fewer features).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Both L1 and L2 regularization have Bayesian interpretations as constraints on the prior of how weights should look.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>L1 regularization can be viewed as a &lt;strong>Laplace prior&lt;/strong> on the weights.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>L2 regularization corresponds to assuming that weights are distributed according to a gaussian distribution with mean $μ = 0$.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>In a gaussian or normal distribution, the further away a value is from the mean, the lower its probability (scaled by the variance $σ$)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>By using a gaussian prior on the weights, we are saying that weights prefer to have the value 0.&lt;/p>
&lt;p>A gaussian for a weight $\theta_j$ is:&lt;/p>
&lt;p>$\frac{1}{\sqrt{2 \pi \sigma_{j}^{2}}} \exp \left(-\frac{\left(\theta_{j}-\mu_{j}\right)^{2}}{2 \sigma_{j}^{2}}\right)$&lt;/p>
&lt;p>If we multiply each weight by a gaussian prior on the weight, we are thus maximizing the following constraint:
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}} \prod_{i=1}^{M} P\left(y^{(i)} | x^{(i)}\right) \times \prod_{j=1}^{n} \frac{1}{\sqrt{2 \pi \sigma_{j}^{2}}} \exp \left(-\frac{\left(\theta_{j}-\mu_{j}\right)^{2}}{2 \sigma_{j}^{2}}\right)
$$
&lt;p>
In log space, with $\mu=0$, and assuming $2\sigma^2=1$, we get:
&lt;/p>
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}} \sum_{i=1}^{m} \log P\left(y^{(i)} | x^{(i)}\right)-\alpha \sum_{j=1}^{n} \theta_{j}^{2}
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="multinomial-logistic-regression">Multinomial Logistic Regression&lt;/h2>
&lt;p>More than two classes?&lt;/p>
&lt;p>Use &lt;strong>multinomial logistic regression&lt;/strong> (also called &lt;strong>softmax regression&lt;/strong>, or &lt;strong>maxent classifier&lt;/strong>). The target $y$ is a variable that ranges over more than two classes; we want to know the probability of $y$ being in each potential class $c \in C, p(y=c|x)$.&lt;/p>
&lt;p>We use the &lt;strong>softmax&lt;/strong> function to compute $p(y=c|x)$:&lt;/p>
&lt;ul>
&lt;li>Takes a vector $z=[z_1, z_2,\dots, z_k]$ of $k$ arbitrary values&lt;/li>
&lt;li>Maps them to a probability distribution
&lt;ul>
&lt;li>Each value $\in (0, 1)$&lt;/li>
&lt;li>All the values summing to $1$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>For a vector $z$ of dimensionality $k$, the softmax is:
&lt;/p>
$$
\operatorname{softmax}\left(z_{i}\right)=\frac{e^{z_{i}}}{\sum_{j=1}^{k} e^{z_{j}}} \qquad 1 \leq i \leq k
$$
&lt;p>
The softmax of an input vector $z=[z_1, z_2,\dots, z_k]$ is thus:
&lt;/p>
$$
\operatorname{softmax}(z)=\left[\frac{e^{z_{1}}}{\sum_{i=1}^{k} e^{z_{i}}}, \frac{e^{z_{2}}}{\sum_{i=1}^{k} e^{z_{i}}}, \ldots, \frac{e^{z_{k}}}{\sum_{i=1}^{k} e^{z_{i}}}\right]
$$
&lt;ul>
&lt;li>The denominator $\sum_{j=1}^{k} e^{z_{j}}$ is used to normalize all the values into probabilities.&lt;/li>
&lt;/ul>
&lt;p>Like the sigmoid, the input to the softmax will be the dot product between a weight vector $w$ and an input vector $x$ (plus a bias). But now we’ll need separate weight vectors (and bias) for each of the $K$ classes.
&lt;/p>
$$
p(y=c | x)=\frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}}
$$
&lt;h3 id="features-in-multinomial-logistic-regression">Features in Multinomial Logistic Regression&lt;/h3>
&lt;p>For multiclass classification, input features are:&lt;/p>
&lt;ul>
&lt;li>observation $x$&lt;/li>
&lt;li>candidate output class $c$&lt;/li>
&lt;/ul>
&lt;p>$\Rightarrow$ When we are discussing features we will use the notation $f_i(c, x)$: feature $i$ for a particular class $c$ for a given observation $x$&lt;/p>
&lt;p>&lt;strong>Example&lt;/strong>&lt;/p>
&lt;p>&lt;em>Suppose we are doing text classification, and instead of binary classification our task is to assign one of the 3 classes +, −, or 0 (neutral) to a document. Now a feature related to exclamation marks might have a negative weight for 0 documents, and a positive weight for + or − documents:&lt;/em>&lt;/p>
&lt;p>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-05-29%2015.59.37.png" alt="截屏2020-05-29 15.59.37">&lt;/p>
&lt;h3 id="learning-in-multinomial-logistic-regression">Learning in Multinomial Logistic Regression&lt;/h3>
&lt;p>The loss function for a single example $x$ is the sum of the logs of the $K$ output classes:
&lt;/p>
$$
\begin{aligned}
L_{C E}(\hat{y}, y) &amp;=-\sum_{k=1}^{K} 1\{y=k\} \log p(y=k | x) \\
&amp;=-\sum_{k=1}^{K} 1\{y=k\} \log \frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}
\end{aligned}
$$
&lt;ul>
&lt;li>$1\{\}$: evaluates to $1$ if the condition in the brackets is true and to $0$ otherwise.&lt;/li>
&lt;/ul>
&lt;p>Gradient:
&lt;/p>
$$
\begin{aligned}
\frac{\partial L_{C E}}{\partial w_{k}} &amp;=-(1\{y=k\}-p(y=k | x)) x_{k} \\
&amp;=-\left(1\{y=k\}-\frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}\right) x_{k}
\end{aligned}
$$
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://web.stanford.edu/~jurafsky/slp3/5.pdf">Logistic Regression&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>