TL;DR Two ways for counting words Number of wordform types Relationship between #Types and #Tokens: Heap’s law Number of lemmas Text Normalization Tokenizing (segmenting) words Bype-pair Encoding (BPE) Wordpiece Normalizing word formats Word normalisation case folding Lemmatization Stemming Porter Stemmer Segmenting sentences Definition Corpus (pl.
2020-08-02
Definition Minimum edit distance between two strings $:=$ the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another. Example The gap between intention and execution, for example, is 5 (delete an i, substitute e for n, substitute x for t, insert c, substitute u for n).
2020-08-02
Regular Expressions Regular Expression (RE) are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through. Basic RE Patterns Case sensitive /s/ is distinct from /S/
2020-08-02