NLP

Perplexity’s Relation to Entropy

Recall A better n-gram model is one that assigns a higher probability to the test data, and perplexity is a normalized version of the probability of the test set.

2020-08-03

Smoothing

To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen.

2020-08-03

Generalization and Zeros

The n-gram model is dependent on the training corpus (like many statistical models). Implication: The probabilities often encode specific facts about a given training corpus. n-grams do a better and better job of modeling the training corpus as we increase the value of $N$.

2020-08-03

Evaluating Language Models

Extrinsic evaluation Best way to evaluate the performance of a language model Embed LM in an application and measure how much the application improves For speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription.

2020-08-03

Natural Language Processing (NLP)

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

2020-08-03

N Gram

Language models (LMs): Model that assign probabilities to sequence of words N-gram: a sequence of N words E.g.: Please turn your homework … bigram (2-gram): two-word sequence of word “please turn”, “turn your”, or ”your homework” trigram (3-gram): three-word sequence of word “please turn your”, or “turn your homework” Motivation $P(w|h)$: probability of a word $w$ given some history $h$.

2020-08-02

Languages Modeling (N-Gram)

2020-08-02

Words and Text Normalization

TL;DR Two ways for counting words Number of wordform types Relationship between #Types and #Tokens: Heap’s law Number of lemmas Text Normalization Tokenizing (segmenting) words Bype-pair Encoding (BPE) Wordpiece Normalizing word formats Word normalisation case folding Lemmatization Stemming Porter Stemmer Segmenting sentences Definition Corpus (pl.

2020-08-02

Minimum Edit Distance

Definition Minimum edit distance between two strings $:=$ the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another. Example The gap between intention and execution, for example, is 5 (delete an i, substitute e for n, substitute x for t, insert c, substitute u for n).

2020-08-02

Regular Expressions

Regular Expressions Regular Expression (RE) are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through. Basic RE Patterns Case sensitive /s/ is distinct from /S/

2020-08-02