2020-08-03
Sequence model/classifier Assign a label or class to each unit in a sequence mapping a sequence of observation to a sequence of labels Hidden Markov Model (HMM) is a probabilistic sequence model
2020-08-03
Parts of speech (a.k.a POS, word classes, or syntactic categories) are useful because they reveal a lot about a word and its neighbors E.g.: Knowing whether a word is a noun or a verb tells us about likely neighboring words
2020-08-03
2020-08-03
Logistic Regression (in NLP) In natural language processing, logistic regression is the base-line supervised machine learning algorithm for classification, and also has a very close relationship with neural networks. Generative and Discriminative Classifier The most important difference between naive Bayes and logistic regression is that
2020-08-03
Supervised classification Input: $x = (x_1, x_2, \dots, x_n)^T$ Output: $y \in \{0, 1\}$ Parameters: Weight: $w = (w_1, w_2, \dots, w_n)^T$ Bias $b$ Prediction $$ z = w \cdot x + b \\ P(y=1|x)=\sigma(z) = \frac{1}{1+e^{-z}}\\ y=\left\{\begin{array}{ll} 1 & \text { if } P(y=1 | x)>0.
2020-08-03
Motivation More than two classes? Use multinomial logistic regression (also called softmax regression, or maxent classifier). The target $y$ is a variable that ranges over more than two classes; we want to know the probability of $y$ being in each potential class $c \in C, p(y=c|x)$.
2020-08-03
Overfitting 🔴 Problem with learning weights that make the model perfectly match the training data: If a feature is perfectly predictive of the outcome because it happens to only occur in one class, it will be assigned a very high weight.
2020-08-03
Overview 🎯 Goal with gradient descent: find the optimal weights that minimize the loss function we’ve defined for the model. From now on, we’ll explicitly represent the fact that the loss function $L$ is parameterized by the weights $\theta$ (in the case of logistic regression $\theta=(w, b)$): $$ \hat{\theta}=\underset{\theta}{\operatorname{argmin}} \frac{1}{m} \sum\_{i=1}^{m} L_{C E}\left(y^{(i)}, x^{(i)} ; \theta\right) $$ Gradient descent finds a minimum of a function by figuring out in which direction (in the space of the parameters $\theta$) the function’s slope is rising the most steeply, and moving in the opposite direction.
2020-08-03
Logistic regression is an instance of supervised classification in which we know the correct label $y$ (either 0 or 1) for each observation $x$. The system produces/predicts $\hat{y}$, the estimate for the true $y$.
2020-08-03