NLP

Sequence Processing with Recurrent Networks

2020-08-03

Sequence model/classifier Assign a label or class to each unit in a sequence mapping a sequence of observation to a sequence of labels Hidden Markov Model (HMM) is a probabilistic sequence model

2020-08-03

Part-of-Speech Tagging

Parts of speech (a.k.a POS, word classes, or syntactic categories) are useful because they reveal a lot about a word and its neighbors E.g.: Knowing whether a word is a noun or a verb tells us about likely neighboring words

2020-08-03

POS Taggig

2020-08-03

Logistic Regression in NLP

Logistic Regression (in NLP) In natural language processing, logistic regression is the base-line supervised machine learning algorithm for classification, and also has a very close relationship with neural networks. Generative and Discriminative Classifier The most important difference between naive Bayes and logistic regression is that

2020-08-03

Logistic Regression: Summry

Supervised classification Input: $x = (x_1, x_2, \dots, x_n)^T$ Output: $y \in \{0, 1\}$ Parameters: Weight: $w = (w_1, w_2, \dots, w_n)^T$ Bias $b$ Prediction $$ z = w \cdot x + b \\ P(y=1|x)=\sigma(z) = \frac{1}{1+e^{-z}}\\ y=\left\{\begin{array}{ll} 1 & \text { if } P(y=1 | x)>0.

2020-08-03

Multinomial Logistic Regression

Motivation More than two classes? Use multinomial logistic regression (also called softmax regression, or maxent classifier). The target $y$ is a variable that ranges over more than two classes; we want to know the probability of $y$ being in each potential class $c \in C, p(y=c|x)$.

2020-08-03

Regularization

Overfitting 🔴 Problem with learning weights that make the model perfectly match the training data: If a feature is perfectly predictive of the outcome because it happens to only occur in one class, it will be assigned a very high weight.

2020-08-03

Gradient Descent

Overview 🎯 Goal with gradient descent: find the optimal weights that minimize the loss function we’ve defined for the model. From now on, we’ll explicitly represent the fact that the loss function $L$ is parameterized by the weights $\theta$ (in the case of logistic regression $\theta=(w, b)$): $$ \hat{\theta}=\underset{\theta}{\operatorname{argmin}} \frac{1}{m} \sum\_{i=1}^{m} L_{C E}\left(y^{(i)}, x^{(i)} ; \theta\right) $$ Gradient descent finds a minimum of a function by figuring out in which direction (in the space of the parameters $\theta$) the function’s slope is rising the most steeply, and moving in the opposite direction.

2020-08-03

Learning in Logistic Regression

Logistic regression is an instance of supervised classification in which we know the correct label $y$ (either 0 or 1) for each observation $x$. The system produces/predicts $\hat{y}$, the estimate for the true $y$.

2020-08-03