Regular Expressions | Haobin Tan

Regular Expressions

Sun, 02 Aug 2020 00:00:00 +0000

Regular Expressions

Regular Expression (RE) are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through.

Basic RE Patterns

Case sensitive

/s/ is distinct from /S/
/woodchucks/ will NOT match the string /Woodchucks/
Disjunction of characters: []

Specify range: `-`

/[2-5]/: any one of the character 2, 3, 4, or 5
/[b-g]/: one of the characters b, c, d, e, f, or g

Not be: `^`

If the caret ˆ is the first symbol after the open square brace [, the resulting pattern is negated.
- /[^a]/ matches any single character (including special characters) except a.

Optionality of the previous char: `?`

“the preceding character or nothing” or “zero or one instances of the previous character”

Zero or more: `` (the Kleene )

“zero or more occurrences of the immediately previous character or regular expression”
- /a*/ means “any string of zero or more as”
  - Will match a or aaaaaa
  - Also match Off Minor (since the string Off Minor has zero a’s)

One or more: `+` (the Kleene +)

“at least one” of some character (“one or more occurrences of the immediately preceding character or regular expression”)
/[0-9]+/ is the normal way to specify “a sequence of digits”

Wildcard expression: `.`

matches any single character (except a carriage return)
Often used together with the Kleene star * to mean “any string of characters”
- E.g. suppose we want to find any line in which a particular word, for example, aardvark, appears twice. We can specify this with /aardvark.*aardvark/

Anchors

special characters that anchor regular expressions to particular places in a string

^: start of a line
- /ˆThe/matches the word The only at the start of a line.
$: end of the line
- /ˆThe dog\.$/matches a line that contains only the phrase The dog.
  - (We have to use the backslash here since we want the . to mean “period” and not the wildcard)
/b: word boundary
- /\bthe\b/ matches the word the but not the word other
- A “word” for the purposes of a regular expression is defined as any sequence of digits, underscores, or letters (based on the definition of “words” in programming languages)
  
  E.g., /\b99\b/ will
  - match the string 99 in There are 99 bottles of beer on the wall (because 99 follows a space) ✅
  - but NOT 99 in There are 299 bottles of beer on the wall (since 99 follows a number) ❌
  - match 99 in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore, or letter)
/B: non-boundary

Disjunction, Grouping, and Precedence

Disjunction operator/pipe symbol: |
/cat|dog/ matches either the string cat or the string dog.
Parenthesis operator: ( and )
- Make the disjunction operator apply only to a specific pattern
  - /gupp(y|ies)/ matches either guppy or guppies
- Groups the whole pattern
  - we have a line that has column labels of the form Column 1 Column 2 Column 3. With the parentheses, we could write the expression /(Column␣[0-9]+␣)/ to match the word Column
Operator precedence hierarchy
- The following table gives the order of RE operator precedence, from highest precedence to lowest precedence
Greedy and non-greedy matching
- Greedy: expanding to cover as much of a string as they can (always match the largest string they can)
- Non-greedy: matches as little text as possible
  - Use ? qualifier to enforce non-greedy matching
  - ?*
  - ?+

Example

Suppose we wanted to write a RE to find cases of the English article the.

A simple (but incorrect) pattern might be: /the/

Problem: this pattern will miss the word when it begins a sentence and hence is capitalized (i.e., The)

This might lead us to the following pattern: /[tT]he/

Problem: still incorrectly return texts with the embedded in other words (e.g., other or theology).

We need to specify that we want instances with a word bound- ary on both sides: /\b[tT]he\b/

Suppose we wanted to do this without the use of /\b/ since /\b/ won’t treat underscores and numbers as word boundaries; but we might want to find the in some context where it might also have underlines or numbers nearby (the_ or the25). We need to specify that we want instances in which there are no alphabetic letters on either side of the the: /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/

Problem: it won’t find the word the when it begins a line.

We can avoid this by specifying that before the the we require either the beginning-of-line or a non-alphabetic character, and the same at the end of the line:

/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/

The process we just went through was based on fixing two kinds of errors:

false positives, strings that we incorrectly matched like other or there,

false negatives, strings that we incorrectly missed, like The.

More operators

Common sets of characters

Counting

Special characters based on the backslash (`\`)

Substitution, Capture Groups

Substitution operator: s/regexp1/pattern/

Allows a string characterized by a regular expression to be replaced by another string

Refer to a particular subpart of the string matching the first pattern

we put parentheses ( and ) around the first pattern and use the number operator \1 in the second pattern to refer back
Example
- suppose we wanted to put angle brackets around all integers in a text, for example, changing the 35 boxes to the <35> boxes.
- We can implement like this: s/([0-9]+)/<\1>/

The parenthesis and number operators can also specify that a certain string or expression must occur twice in the text.

E.g.: suppose we are looking for the pattern “the Xer they were, the Xer they will be”, where we want to constrain the two X’s to be the same string
We do this by surrounding the first X with the parenthesis operator, and replacing the second X with the number operator \1

/the (.*)er they were, the \1er they will be/
- Here the \1 will be replaced by whatever string matched the first item in parentheses.
- So this will match the bigger they were, the bigger they will be but not the bigger they were, the faster they will be.

This use of parentheses to store a pattern in memory is called a capture group. Every time a capture group is used (i.e., parentheses surround a pattern), the re- sulting match is stored in a numbered register. Similarly, the third capture group is stored in \3, the fourth is \4, and so on.

E.g.: /the (.*)er they (.*), the \1er we \2/

will match the faster they ran, the faster we ran but not the faster they ran, the faster we ate.

Parentheses thus have a double function in regular expressions

they are used to group terms for specifying the order in which operators should apply
they are used to capture something in a register

Sometimes we might want to use parentheses for grouping, but do NOT want to capture the resulting pattern in a register. In that case we use a non-capturing group, which is specified by putting the commands ?: after the open paren, in the form (?: pattern ).

E.g.:

/(?:some|a few) (people|cats) like some \1/

will match some cats like some cats but not some cats like some a few.

Minimum Edit Distance

Sun, 02 Aug 2020 00:00:00 +0000

Definition

Minimum edit distance between two strings $:=$ the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another.

Example

The gap between intention and execution, for example, is 5 (delete an i, substitute e for n, substitute x for t, insert c, substitute u for n).
Visualization

Levenshtein distance

Original version:

Each of the three operations (insertion, deletion, substitution) has a cost of 1
The substitution of a letter for itself (E.g., t for t), has zero cost.
The Levenshtein distance between intention and execution is 5

Alternative version:

Insertion or deletion has a cost of 1
Substitution has a cost of 2 (since any substitution can be represented by one insertion and one deletion)
Using this version, the Levenshtein distance between intention and execution is 8.

The Minimum Edit Distance Algorithm

How do we find the minimum edit distance?

💡 Think of this as a search task, in which we are searching for the shortest path—a sequence of edits—from one string to another.

Just remember the shortest path to a state each time we saw it.
- We can do this by using dynamic programming

Dynamic programming

💡 Intuition: a large problem can be solved by properly combining the solutions to various sub-problems
Apply a table-driven method to solve problems by combining solutions to sub-problems
Example: Consider the shortest path of transformed words that represents the minimum edit distance between the strings intention and execution

Imagine some string (perhaps it is exention) that is in this optimal path (whatever it is). The intuition of dynamic programming is that if exention is in the optimal operation list, then the optimal sequence must also include the optimal path from intention to exention. Why? If there were a shorter path from intention to exention, then we could use it instead, resulting in a shorter overall path, and the optimal sequence wouldn’t be optimal, thus leading to a contradiction.

Minimum Edit Distance Algorithm

Define the minimum edit distance between two string

Given:
- Source string $X$ of length $n$
- Target string $Y$ of length m
$D[i, j]:=$ edit distance between $X[1..i]$ and $Y[1..j]$ (the first $i$ characters of $X$ and the first $j$ characters of $Y$)
Thus, The edit distance between $X$ and $Y$ is $D[n, m]$

We’ll use dynamic programming to compute $D[n, m]$ bottom up, combining solutions to subproblems.

Base case:
- source substring of length $i$ but an empty target string, going from $i$ characters to 0 requires $i$ deletes.
- target substring of length $j$ but an empty source going from 0 characters to $j$ characters requires $j$ inserts
Having computed $D[i,j]$ for small $i, j$ we then compute larger $D[i,j]$ based on previously computed smaller values. The value of is$D[i,j]$ computed by taking the minimum of the three possible paths through the matrix which arrive there:

If we assume the version of Levenshtein distance in which the insertions and deletions each have a cost of 1 ($\text { ins-cost }(\cdot)=\operatorname{del-cost}(\cdot)=1$), and substitutions have a cost of 2 (except substitution of identical letters have zero cost), the computation for $D[i,j]$ becomes:

Pseudocode

Example

Minimum Cost Alignment

With a small change, the edit distance algorithm can also provide the minimum cost alignment between two strings.

To extend the edit distance algorithm to produce an alignment, we can start by visualizing an alignment as a path through the edit distance matrix.

Boldfaced cell: represents an alignment of a pair of letters in the two strings.
- If two boldfaced cells occur in the same row, there will be an insertion in going from the source to the target
- two boldfaced cells in the same column indicate a deletion.

Computation:

we augment the minimum edit distance algorithm to store backpointers in each cell.
- The backpointer from a cell points to the previous cell (or cells) that we came from in entering the current cell.
- Some cells have mul- tiple backpointers because the minimum extension could have come from multiple previous cells.
we perform a backtrace.
- we start from the last cell (at the final row and column), and follow the pointers back through the dynamic programming matrix. Each complete path between the final cell and the initial cell is a minimum distance alignment.

Words and Text Normalization

Sun, 02 Aug 2020 00:00:00 +0000

TL;DR

Two ways for counting words
- Number of wordform types
  - Relationship between #Types and #Tokens: Heap’s law
- Number of lemmas
Text Normalization
1. Tokenizing (segmenting) words
  - Bype-pair Encoding (BPE)
  - Wordpiece
2. Normalizing word formats
  - Word normalisation
    - case folding
  - Lemmatization
  - Stemming
    - Porter Stemmer
3. Segmenting sentences

Definition

Corpus (pl. corpora): a computer-readable collection of text or speech.
Lemma: a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense.
- E.g.: cats and cat have the same lemma cat
Wordform: full inflected or derived form of the word
- E.g.: cats and cat have the same lemma cat but are different wordforms

How many words are there in English?

To answer this question we need to distinguish two ways of talking about words.

One way: number of wordform types
- Type: number of distinct words in a corpus
  - if the set of words in the vocabulary is $V$ , the number of types is the vocabulary size $|V|$.
  - When we speak about the number of words in the language, we are generally referring to word types.
  - The larger the corpora we look at, the more word types we find
- Tokens: total number $N$ of running words
  - E.g.: If we ignore punctuation, the following Brown sentence has 16 tokens and 14 types:
```
They picnicked by the pool, then lay back on the grass and looked at the stars.
```
  - Relationship between the number of types $|V|$ and number of tokens $N$: Herdan’s Law or Heap’s Law
    $$ |V|=k N^{\beta} $$
    - $k$: positive constant
    - $\beta \in (0, 1)$
      - depends on the corpus size and the genre
      - for the large corpora ranges from 0.67 to 0.75
Another way: number of lemmas
- Dictionary entries or boldface forms are a very rough upper bound on the number of lemmas

Text Normalization

three tasks are commonly applied as part of any normalization process:

Tokenizing (segmenting) words
Normalizing word formats
Segmenting sentences

Word Tokenization

Tokenization: tasks of segmenting running text into words.

For most NLP applications we’ll need to keep numbers and punctuation in our tokenization

punctuation
- as a separate token
  - commas ,: useful piece of information for parsers
  - periods .: help indicate sentence boundaries
- we also want to keep the punctuation that occurs word internally
  - E.g.: m.p.h,, Ph.D., AT&T, cap’n
Special characters and numbers need to be keep in
- prices ($45.55)
- dates (01/02/06)
- URLs (http://www.stanford.edu)
- Twitter hashtags (#nlproc)
- email address (someone@cs.colorado.edu)

A tokenizer can be used to

expand clitic contractions that are marked by apostrophes
- what're -> what are
named entity detection
- tokenize multiword expressions like New York or rock ’n’ roll as a single token, which requires a multiword expression dictionary of some sort.

Commonly used tokenization standard: Penn Treebank tokenization standard

Byte-Pair Encoding for Tokenization

💡 Instead of defining tokens as words (defined by spaces in orthographies that have spaces, or more complex algorithms), or as characters (as in Chinese), we can use our data to automatically tell us what size tokens should be.

Morpheme: smallest meaning-bearing unit of a language

E.g.: the word unlikeliest has the morphemes un-, likely, and -est

One reason it’s helpful to have subword tokens is to deal with unknown words.

Unknown words are particularly relevant for machine learning systems. Machine learning systems often learn some facts about words in one corpus (a training corpus) and then use these facts to make decisions about a separate test corpus and its words. Thus if our training corpus contains, say the words low, and lowest, but not lower, but then the word lower appears in our test corpus, our system will not know what to do with it. 🤪

🔧 Solution: use a kind of tokenization in which most tokens are words, but some tokens are frequent morphemes or other subwords like -er, so that an unseen word can be represented by combining the parts.

Simplest algorithm: byte-pair encoding (BPE)

💡 Intuition: iteratively merge frequent pairs of characters
How it works?
- Begins with the set of symbols equal to the set of characters.
  - Each word is represented as a sequence of characters plus a special end-of-word symbol _.
- At each step of the algorithm, we count the number of symbol pairs, find the most frequent pair (‘A’, ‘B’), and replace it with the new merged symbol (‘AB’)
- We continue to count and merge, creating new longer and longer character strings, until we’ve done $k$ merges ($k$ is a parameter of the algorithm)
- The resulting symbol set will consist of the original set of characters plus $k$ new symbols.
The algorithm is run inside words (we don’t merge across word boundaries). For this reason, the algorithm can take as input a dictionary of words together with counts.
Example

Consider the following tiny input dictionary with counts for each word,

which would have the starting vocabulary of 11 letters
- We first count all pairs of symbols: the most frequent is the pair (r, _) because it occurs in newer (frequency of 6) and wider (frequency of 3) for a total of 9 occurrences. We then merge these symbols, treating r_ as one symbol, and count again.
- Now the most frequent pair is (e, r_) , which we merge; our system has learned that there should be a token for word-final er, represented as er_
- Next (e, w) (total count of 8) get merged to ew:
- If we continue, the next merges are:
Test
- When we need to tokenize a test sentence, we just run the merges we have learned, greedily, in the order we learned them, on the test data. (Thus the frequencies in the test data don’t play a role, just the frequencies in the training data).
  - first we segment each test sentence word into characters.
  - Then we apply the first rule: replace every instance of r _ in the test corpus with r_ ; and then the second rule: replace every instance of e r_ in the test corpus with er_, and so on.
  - By the end, if the test corpus contained the word n e w e r _ , it would be tokenized as a full word. But a new (unknown) word like l o w e r _ would be merged into the two tokens low er_ .
In real algorithms BPE
- run with many thousands of merges on a very large input dictionary
- Result: most words will be represented as full symbols, and only the very rare words (and unknown words) will have to be represented by their parts.

Wordpiece and Greedy Tokenization

The wordpiece algorithm starts with some simple tokenization (such as by whitespace) into rough words, and then breaks those rough word tokens into subword tokens.

Difference from BPE:

The special word-boundary token _ appears at the beginning of words (rather than at the end)
Rather than merging the pairs that are most frequent, wordpiece instead merges the pairs that minimizes the language model likelihood of the training data.

(the wordpiece model chooses the two tokens to combine that would give the training corpus the highest probability )

How it works?

An input sentence or string is first split by some simple basic tokenizer (like whitespace) into a series of rough word tokens.
Then instead of using a word boundary token, word-initial subwords are distinguished from those that do not start words by marking internal subwords with special symbols ##
- we might split unaffable into [un, ##aff", ##able]
Then each word token string is tokenized using a greedy longest-match-first algorithm.
- Also called maximum matching or MaxMatch.
  - Given a vocabulary (a learned list of wordpiece tokens) and a string
  - Starts by pointing at the beginning of a string
  - It chooses the longest token in the wordpiece vocabulary that matches the input at the current position, and moves the pointer past that word in the string.
  - The algorithm is then applied again starting from the new pointer position.

Example:

Given the token intention and the dictionary:

["in", "tent","intent","##tent", "##tention", "##tion", "#ion"]

The tokenizer would choose intent (because it is longer than in, and then ##ion to complete the string, resulting in the tokenization ["intent" "##ion"].

Word Normalization, Lemmatization and Stemming

Word normalization: task of putting words/tokens in a standard format, choosing a single normal form for words with multiple forms
- Case folding: Mapping everything to lower case
  - Woodchuck and woodchuck are represented identically
- For many natural language processing situations we also want two morphologically different forms of a word to behave similarly.
Lemmatization: task of determining that two words have the same root, despite their surface differences.
- E.g.
  - am, are, and is have the shared lemma be
  - dinner and dinners both have the lemma dinner
  - The lemmatized form of a sentence like He is reading detective stories would thus be He be read detective story.
- Method: complete morphological parsing of the word.
  - Morphology: study of the way words are built up from smaller meaning-bearing units called morphemes.
  - Two board classes of morphemes
    - Stems: the central morpheme of the word, supplying the main meaning
    - Affixes: adding “additional” meanings of various kinds
  - E.g.:
    - the word fox consists of one morpheme (the morpheme fox)
    - the word cats consists of two: the morpheme cat and the morpheme -s.
Stemming: naive version of morphological analysis
- Most widely used stemming algorithms: the Porter Stemmer
- Based on series of rewrite rules run in series, as a cascade, in which the output of each pass is fed as input to the next pass
  
  Sampling of rules:
- Simple stemmers can be useful in cases where we need to collapse across different variants of the same lemma
- Nonetheless, they do tend to commit errors of both over- and under-generalizing 🤪

Sentence Segmentation

The most useful cues for segmenting a text into sentences are punctuation

Question marks and exclamation points are relatively unambiguous markers of sentence boundaries 👏
Periods are more ambiguous 🤪
- The period character “.” is ambiguous between a sentence boundary marker and a marker of abbreviations like Mr. or `Inc. (the final period of Inc. marked both an abbreviation and the sentence boundary marker 🤪)

Sentence tokenization methods work by first deciding (based on rules or machine learning) whether a period is part of the word or is a sentence-boundary marker.

An abbreviation dictionary can help determine whether the period is part of a commonly used abbreviation

Reference

Regular Expressions, Text Normalization, and Edit Distance

Regular Expressions | Haobin Tan

Regular Expressions