Logistic Regression: Summry

Logistic Regression: Summry

  • Supervised classification

  • Input: $x = (x_1, x_2, \dots, x_n)^T$

  • Output: $y \in \{0, 1\}$

  • Parameters:

    • Weight: $w = (w_1, w_2, \dots, w_n)^T$
    • Bias $b$
  • Prediction

    $$ z = w \cdot x + b \\ P(y=1|x)=\sigma(z) = \frac{1}{1+e^{-z}}\\ y=\left\{\begin{array}{ll} 1 & \text { if } P(y=1 | x)>0.5 \\ 0 & \text { otherwise } \end{array}\right. $$
  • Training/Learning

    • Loss function

      • For a single sample $x$

        $$ \hat{y} = \sigma(w \cdot x + b) $$

        And we define $\hat{y}:=P(y=1|x)$

        $y \in \{0, 1\} \Rightarrow$

        $$ P(y | x)=\left\{\begin{array}{lr} \hat{y} & y=1 \\ 1-\hat{y} & y=0 \end{array}\right. $$

        The probability of correct prediction can thus be expressed as: $$ P(y|x)=\hat{y}^y (1-\hat{y})^{1-y} $$ We want to maximize $P(y|x)$ $$ \begin{array}{ll} &\max \quad P(y|x) \\ \equiv &\max \quad \log(P(y|x)) \\ = &\max \quad \log(\hat{y}^y (1-\hat{y})^{1-y})\\ = &\max \quad y\log(\hat{y}) + (1-y)\log(1-\hat{y}) \\ \equiv &\min \quad -[y \log \hat{y}+(1-y) \log (1-\hat{y})] \\ = &\min \quad \underbrace{-[y \log \sigma(w \cdot x + b) + (1-y) \log (1-\sigma(w \cdot x + b))]}_{=:L_{CE}(w, b)} \\ \end{array} $$ $L_{CE}(w, b)$ is called the Cross-Entropy loss.

      • For a mini-batch of samples of size $m$

        • $(x^{(i)}, y^{(i)})$: $i$-th Training sample

        • Loss function is the average loss for each example

          $$ L(w, b) = =-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \sigma\left(w \cdot x^{(i)}+b\right)+\left(1-y^{(i)}\right) \log \left(1-\sigma\left(w \cdot x^{(i)}+b\right)\right) $$
    • Algorithm: Gradient descent

      • Gradient for single sample: $$ \frac{\partial L_{C E}(w, b)}{\partial w_{j}}=[\sigma(w \cdot x+b)-y] x_{j} $$

      • Gradient for mini-batch: $$ \frac{\partial L(w, b)}{\partial w_{j}}=\frac{1}{m} \sum_{i=1}^{m}\left[\sigma\left(w \cdot x^{(i)}+b\right)-y^{(i)}\right] x_{j}^{(i)} $$

        • $x_j^{(i)}$: $j$-th feature of the $i$-th sample

Multinomial Logistic Regression

  • Also called Softmax regression, MaxEnt classifier

  • Softmax function $$ \operatorname{softmax}\left(z_{i}\right)=\frac{e^{z_{i}}}{\sum_{j=1}^{k} e^{z_{j}}} \qquad 1 \leq i \leq k $$

  • Compute the probability of $y$ being in each potential class $c \in C$, $p(y=c|x)$, using softmax function: $$ p(y=c | x)=\frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}} $$

  • Prediction $$ \begin{array}{ll} \hat{c} &= \underset{c}{\arg \max} \quad p(y=c | x) \\ &= \underset{c}{\arg \max} \quad \frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}} \end{array} $$

  • Learning

    • For a singnle sample $x$, the loss function is $$ \begin{aligned} L_{C E}(w, b) &=-\sum_{k=1}^{K} 1\{y=k\} \log p(y=k | x) \\ &=-\sum_{k=1}^{K} 1\{y=k\} \log \frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}} \end{aligned} $$

      • $1\{\}$: evaluates to $1$ if the condition in the brackets is true and to $0$ otherwise.
    • Gradient $$ \begin{aligned} \frac{\partial L_{C E}}{\partial w_{k}} &=-(1\{y=k\}-p(y=k | x)) x_{k} \\ &=-\left(1\{y=k\}-\frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}\right) x_{k} \end{aligned} $$