Logistic Regression: Summry

Supervised classification
Input: $x = (x_1, x_2, \dots, x_n)^T$
Output: $y \in \{0, 1\}$
Parameters:
- Weight: $w = (w_1, w_2, \dots, w_n)^T$
- Bias $b$
Prediction
$z = w \cdot x + b \\ P(y=1|x)=\sigma(z) = \frac{1}{1+e^{-z}}\\ y=\left\{\begin{array}{ll} 1 & \text { if } P(y=1 | x)>0.5 \\ 0 & \text { otherwise } \end{array}\right.$
Training/Learning
- Loss function
  - For a single sample $x$
    $\hat{y} = \sigma(w \cdot x + b)$
    And we define $\hat{y}:=P(y=1|x)$
    $y \in \{0, 1\} \Rightarrow$
    $P(y | x)=\left\{\begin{array}{lr} \hat{y} & y=1 \\ 1-\hat{y} & y=0 \end{array}\right.$
    The probability of correct prediction can thus be expressed as: $P(y|x)=\hat{y}^y (1-\hat{y})^{1-y}$ We want to maximize $P(y|x)$ $\begin{array}{ll} &\max \quad P(y|x) \\ \equiv &\max \quad \log(P(y|x)) \\ = &\max \quad \log(\hat{y}^y (1-\hat{y})^{1-y})\\ = &\max \quad y\log(\hat{y}) + (1-y)\log(1-\hat{y}) \\ \equiv &\min \quad -[y \log \hat{y}+(1-y) \log (1-\hat{y})] \\ = &\min \quad \underbrace{-[y \log \sigma(w \cdot x + b) + (1-y) \log (1-\sigma(w \cdot x + b))]}_{=:L_{CE}(w, b)} \\ \end{array}$ $L_{CE}(w, b)$ is called the Cross-Entropy loss.
  - For a mini-batch of samples of size $m$
    - $(x^{(i)}, y^{(i)})$ : $i$ -th Training sample
    - Loss function is the average loss for each example
      $L(w, b) = =-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \sigma\left(w \cdot x^{(i)}+b\right)+\left(1-y^{(i)}\right) \log \left(1-\sigma\left(w \cdot x^{(i)}+b\right)\right)$
- Algorithm: Gradient descent
  - Gradient for single sample: $\frac{\partial L_{C E}(w, b)}{\partial w_{j}}=[\sigma(w \cdot x+b)-y] x_{j}$
  - Gradient for mini-batch: $\frac{\partial L(w, b)}{\partial w_{j}}=\frac{1}{m} \sum_{i=1}^{m}\left[\sigma\left(w \cdot x^{(i)}+b\right)-y^{(i)}\right] x_{j}^{(i)}$
    - $x_j^{(i)}$ : $j$ -th feature of the $i$ -th sample

Multinomial Logistic Regression

Also called Softmax regression, MaxEnt classifier
Softmax function $\operatorname{softmax}\left(z_{i}\right)=\frac{e^{z_{i}}}{\sum_{j=1}^{k} e^{z_{j}}} \qquad 1 \leq i \leq k$
Compute the probability of $y$ being in each potential class $c \in C$ , $p(y=c|x)$ , using softmax function: $p(y=c | x)=\frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}}$
Prediction $\begin{array}{ll} \hat{c} &= \underset{c}{\arg \max} \quad p(y=c | x) \\ &= \underset{c}{\arg \max} \quad \frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}} \end{array}$
Learning
- For a singnle sample $x$ , the loss function is $\begin{aligned} L_{C E}(w, b) &=-\sum_{k=1}^{K} 1\{y=k\} \log p(y=k | x) \\ &=-\sum_{k=1}^{K} 1\{y=k\} \log \frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}} \end{aligned}$
  - $1\{\}$ : evaluates to $1$ if the condition in the brackets is true and to $0$ otherwise.
- Gradient $\begin{aligned} \frac{\partial L_{C E}}{\partial w_{k}} &=-(1\{y=k\}-p(y=k | x)) x_{k} \\ &=-\left(1\{y=k\}-\frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}\right) x_{k} \end{aligned}$

Last updated on 2024-09-05

← Multinomial Logistic Regression 2020-08-03

Logistic Regression in NLP 2020-08-03 →