Logistic Regression: Basics
💡 Use regression algorithm for classification
Logistic regression: estimate the probability that an instance belongs to a particular class
- If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”),
- or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”).
This makes it a binary classifier.
Logistic / Sigmoid function
$\sigma(t)=\frac{1}{1+\exp (-t)}$
Bounded: $\sigma(t) \in (0, 1)$
Symmetric: $1 - \sigma(t) = \sigma(-t)$
Derivative: $\sigma^{\prime}(t)=\sigma(t)(1-\sigma(t))$
Estimating probabilities and making prediction
Computes a weighted sum of the input features (plus a bias term)
Outputs the logistic of this result
$\hat{p}=h_{\theta}(\mathbf{x})=\sigma\left(\mathbf{x}^{\mathrm{T}} \boldsymbol{\theta}\right)$
Prediction:
$$ \hat{y} = \begin{cases} 0 & \text{ if } \hat{p}<0.5\left(\Leftrightarrow h_{\theta}(\mathbf{x})<0\right) \\\\ 1 & \text{ if }\hat{p} \geq 0.5\left(\Leftrightarrow h_{\theta}(\mathbf{x}) \geq 0\right)\end{cases} $$
Train and cost function
Objective of training: to set the parameter vector $\boldsymbol{\theta}$ so that the model estimates:
- high probabilities ($\geq 0.5$) for positive instances ($y=1$)
- low probabilities ($< 0.5$) for negative instances ($y=0$)
Cost function of a single training instance:
$$ c(\boldsymbol{\theta}) = \begin{cases} -\log (\hat{p}) & \text{ if } y=1 \\\\ -\log (1-\hat{p}) & \text{ if } y=0\end{cases} $$
- Actual lable: $y=1$, Misclassification: $\hat{y} = 0 \Leftrightarrow$ $\hat{p} = \sigma(h_{\boldsymbol{\theta}}(x))$ close to 0 $\Leftrightarrow c(\boldsymbol{\theta})$ large
- Actual lable: $y=0$, Misclassification: $\hat{y} = 1 \Leftrightarrow$ $\hat{p} = \sigma(h_{\boldsymbol{\theta}}(x))$ close to 1 $\Leftrightarrow c(\boldsymbol{\theta})$ large
The cost function over the whole training set
Simply the average cost over all training instances (Combining the expressions of two different cases above into one single expression):
$\begin{aligned} J(\boldsymbol{\theta}) &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(\hat{p}^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)\right] \\\\ &=\frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)\right] \end{aligned}$
- $y^{(i)} =1:-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)=-\log \left(\hat{p}^{(i)}\right)$
- $y^{(i)} =0:-y^{(i)} \log \left(\hat{p}^{(i)}\right)-\left(1-y^{(i)}\right) \log \left(1-\hat{p}^{(i)}\right)=-\log \left(1-\hat{p}^{(i)}\right)$ (Exactly the same as $c(\boldsymbol{\theta})$ for a single instance above 👏)
Training
No closed-form equation 🤪
But it is convex so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum
Partial derivatives of the cost function with regards to the $j$-th model parameter $\theta_j$:
$$ \frac{\partial}{\partial \theta_{j}} J(\boldsymbol{\theta})=\frac{1}{m} \displaystyle \sum_{i=1}^{m}\left(\sigma\left(\boldsymbol{\theta}^{T} \mathbf{x}^{(l)}\right)-y^{(i)}\right) x_{j}^{(i)} $$
