Logistic Regression: Summry

Logistic Regression: Summry

  • Supervised classification

  • Input: x=(x1,x2,…,xn)Tx = (x_1, x_2, \dots, x_n)^T

  • Output: y∈{0,1}y \in \{0, 1\}

  • Parameters:

    • Weight: w=(w1,w2,…,wn)Tw = (w_1, w_2, \dots, w_n)^T
    • Bias bb
  • Prediction

    z=wβ‹…x+bP(y=1∣x)=Οƒ(z)=11+eβˆ’zy={1 if P(y=1∣x)>0.50 otherwise  z = w \cdot x + b \\ P(y=1|x)=\sigma(z) = \frac{1}{1+e^{-z}}\\ y=\left\{\begin{array}{ll} 1 & \text { if } P(y=1 | x)>0.5 \\ 0 & \text { otherwise } \end{array}\right.
  • Training/Learning

    • Loss function

      • For a single sample xx

        y^=Οƒ(wβ‹…x+b) \hat{y} = \sigma(w \cdot x + b)

        And we define y^:=P(y=1∣x)\hat{y}:=P(y=1|x)

        y∈{0,1}β‡’y \in \{0, 1\} \Rightarrow

        P(y∣x)={y^y=11βˆ’y^y=0 P(y | x)=\left\{\begin{array}{lr} \hat{y} & y=1 \\ 1-\hat{y} & y=0 \end{array}\right.

        The probability of correct prediction can thus be expressed as: P(y∣x)=y^y(1βˆ’y^)1βˆ’y P(y|x)=\hat{y}^y (1-\hat{y})^{1-y} We want to maximize P(y∣x)P(y|x) max⁑P(y∣x)≑max⁑log⁑(P(y∣x))=max⁑log⁑(y^y(1βˆ’y^)1βˆ’y)=max⁑ylog⁑(y^)+(1βˆ’y)log⁑(1βˆ’y^)≑minβ‘βˆ’[ylog⁑y^+(1βˆ’y)log⁑(1βˆ’y^)]=minβ‘βˆ’[ylog⁑σ(wβ‹…x+b)+(1βˆ’y)log⁑(1βˆ’Οƒ(wβ‹…x+b))]⏟=:LCE(w,b) \begin{array}{ll} &\max \quad P(y|x) \\ \equiv &\max \quad \log(P(y|x)) \\ = &\max \quad \log(\hat{y}^y (1-\hat{y})^{1-y})\\ = &\max \quad y\log(\hat{y}) + (1-y)\log(1-\hat{y}) \\ \equiv &\min \quad -[y \log \hat{y}+(1-y) \log (1-\hat{y})] \\ = &\min \quad \underbrace{-[y \log \sigma(w \cdot x + b) + (1-y) \log (1-\sigma(w \cdot x + b))]}_{=:L_{CE}(w, b)} \\ \end{array} LCE(w,b)L_{CE}(w, b) is called the Cross-Entropy loss.

      • For a mini-batch of samples of size mm

        • (x(i),y(i))(x^{(i)}, y^{(i)}): ii-th Training sample

        • Loss function is the average loss for each example

          L(w,b)==βˆ’1mβˆ‘i=1my(i)log⁑σ(wβ‹…x(i)+b)+(1βˆ’y(i))log⁑(1βˆ’Οƒ(wβ‹…x(i)+b)) L(w, b) = =-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \sigma\left(w \cdot x^{(i)}+b\right)+\left(1-y^{(i)}\right) \log \left(1-\sigma\left(w \cdot x^{(i)}+b\right)\right)
    • Algorithm: Gradient descent

      • Gradient for single sample: βˆ‚LCE(w,b)βˆ‚wj=[Οƒ(wβ‹…x+b)βˆ’y]xj \frac{\partial L_{C E}(w, b)}{\partial w_{j}}=[\sigma(w \cdot x+b)-y] x_{j}

      • Gradient for mini-batch: βˆ‚L(w,b)βˆ‚wj=1mβˆ‘i=1m[Οƒ(wβ‹…x(i)+b)βˆ’y(i)]xj(i) \frac{\partial L(w, b)}{\partial w_{j}}=\frac{1}{m} \sum_{i=1}^{m}\left[\sigma\left(w \cdot x^{(i)}+b\right)-y^{(i)}\right] x_{j}^{(i)}

        • xj(i)x_j^{(i)}: jj-th feature of the ii-th sample

Multinomial Logistic Regression

  • Also called Softmax regression, MaxEnt classifier

  • Softmax function softmax⁑(zi)=eziβˆ‘j=1kezj1≀i≀k \operatorname{softmax}\left(z_{i}\right)=\frac{e^{z_{i}}}{\sum_{j=1}^{k} e^{z_{j}}} \qquad 1 \leq i \leq k

  • Compute the probability of yy being in each potential class c∈Cc \in C, p(y=c∣x)p(y=c|x), using softmax function: p(y=c∣x)=ewcβ‹…x+bcβˆ‘j=1kewjβ‹…x+bj p(y=c | x)=\frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}}

  • Prediction c^=arg⁑max⁑cp(y=c∣x)=arg⁑max⁑cewcβ‹…x+bcβˆ‘j=1kewjβ‹…x+bj \begin{array}{ll} \hat{c} &= \underset{c}{\arg \max} \quad p(y=c | x) \\ &= \underset{c}{\arg \max} \quad \frac{e^{w_{c} \cdot x+b_{c}}}{\displaystyle\sum_{j=1}^{k} e^{w_{j} \cdot x+b_{j}}} \end{array}

  • Learning

    • For a singnle sample xx, the loss function is LCE(w,b)=βˆ’βˆ‘k=1K1{y=k}log⁑p(y=k∣x)=βˆ’βˆ‘k=1K1{y=k}log⁑ewkβ‹…x+bkβˆ‘j=1Kewjβ‹…x+bj \begin{aligned} L_{C E}(w, b) &=-\sum_{k=1}^{K} 1\{y=k\} \log p(y=k | x) \\ &=-\sum_{k=1}^{K} 1\{y=k\} \log \frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}} \end{aligned}

      • 1{}1\{\}: evaluates to 11 if the condition in the brackets is true and to 00 otherwise.
    • Gradient βˆ‚LCEβˆ‚wk=βˆ’(1{y=k}βˆ’p(y=k∣x))xk=βˆ’(1{y=k}βˆ’ewkβ‹…x+bkβˆ‘j=1Kewjβ‹…x+bj)xk \begin{aligned} \frac{\partial L_{C E}}{\partial w_{k}} &=-(1\{y=k\}-p(y=k | x)) x_{k} \\ &=-\left(1\{y=k\}-\frac{e^{w_{k} \cdot x+b_{k}}}{\sum_{j=1}^{K} e^{w_{j} \cdot x+b_{j}}}\right) x_{k} \end{aligned}