Recall A better n-gram model is one that assigns a higher probability to the test data, and perplexity is a normalized version of the probability of the test set.
2020-08-03
To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen.
2020-08-03
The n-gram model is dependent on the training corpus (like many statistical models). Implication: The probabilities often encode specific facts about a given training corpus. n-grams do a better and better job of modeling the training corpus as we increase the value of $N$.
2020-08-03
Extrinsic evaluation Best way to evaluate the performance of a language model Embed LM in an application and measure how much the application improves For speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription.
2020-08-03

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
2020-08-03
Language models (LMs): Model that assign probabilities to sequence of words N-gram: a sequence of N words ​ E.g.: Please turn your homework … bigram (2-gram): two-word sequence of word “please turn”, “turn your”, or ”your homework” trigram (3-gram): three-word sequence of word “please turn your”, or “turn your homework” Motivation $P(w|h)$: probability of a word $w$ given some history $h$.
2020-08-02
2020-08-02
TL;DR Two ways for counting words Number of wordform types Relationship between #Types and #Tokens: Heap’s law Number of lemmas Text Normalization Tokenizing (segmenting) words Bype-pair Encoding (BPE) Wordpiece Normalizing word formats Word normalisation case folding Lemmatization Stemming Porter Stemmer Segmenting sentences Definition Corpus (pl.
2020-08-02
Definition Minimum edit distance between two strings $:=$ the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another. Example The gap between intention and execution, for example, is 5 (delete an i, substitute e for n, substitute x for t, insert c, substitute u for n).
2020-08-02
Regular Expressions Regular Expression (RE) are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through. Basic RE Patterns Case sensitive /s/ is distinct from /S/
2020-08-02