<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Regular Expressions | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/regular-expressions/</link><atom:link href="https://haobin-tan.netlify.app/tags/regular-expressions/index.xml" rel="self" type="application/rss+xml"/><description>Regular Expressions</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sun, 02 Aug 2020 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Regular Expressions</title><link>https://haobin-tan.netlify.app/tags/regular-expressions/</link></image><item><title>Regular Expressions</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/re-text_normalization-edit_distance/02_1-regular_expressions/</link><pubDate>Sun, 02 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/re-text_normalization-edit_distance/02_1-regular_expressions/</guid><description>&lt;h2 id="regular-expressions">Regular Expressions&lt;/h2>
&lt;p>&lt;strong>Regular Expression (RE)&lt;/strong> are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through.&lt;/p>
&lt;h2 id="basic-re-patterns">Basic RE Patterns&lt;/h2>
&lt;h3 id="case-sensitive">&lt;strong>Case sensitive&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;code>/s/&lt;/code> is distinct from &lt;code>/S/&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>/woodchucks/&lt;/code> will NOT match the string &lt;code>/Woodchucks/&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Disjunction&lt;/strong> of characters: &lt;code>[]&lt;/code>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-31%2015.08.12.png" alt="截屏2020-05-31 15.08.12" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="specify-range--">Specify &lt;strong>range&lt;/strong>: &lt;code>-&lt;/code>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;code>/[2-5]/&lt;/code>: any one of the character &lt;em>2, 3, 4, or 5&lt;/em>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>/[b-g]/&lt;/code>: one of the characters &lt;em>b&lt;/em>, &lt;em>c&lt;/em>, &lt;em>d&lt;/em>, &lt;em>e&lt;/em>, &lt;em>f&lt;/em>, or &lt;em>g&lt;/em>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-31%2015.10.03.png" alt="截屏2020-05-31 15.10.03" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="not-be-">&lt;strong>Not be&lt;/strong>: &lt;code>^&lt;/code>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>If the caret &lt;code>ˆ&lt;/code> is the first symbol after the open square brace &lt;code>[&lt;/code>, the resulting pattern is negated.&lt;/p>
&lt;ul>
&lt;li>&lt;code>/[^a]/&lt;/code> matches any single character (including special characters) except &lt;em>a&lt;/em>.&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-31%2015.13.08.png" alt="截屏2020-05-31 15.13.08" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="optionality-of-the-previous-char-">&lt;strong>Optionality&lt;/strong> of the previous char: &lt;code>?&lt;/code>&lt;/h3>
&lt;ul>
&lt;li>“the preceding character or nothing” or &amp;ldquo;zero or one instances of the previous character&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-31%2015.15.33.png" alt="截屏2020-05-31 15.15.33" style="zoom:80%;" />
&lt;h3 id="zero-or-more--the-kleene-">&lt;strong>Zero or more&lt;/strong>: &lt;code>*&lt;/code> (the Kleene *)&lt;/h3>
&lt;ul>
&lt;li>“zero or more occurrences of the immediately previous character or regular expression”
&lt;ul>
&lt;li>&lt;code>/a*/&lt;/code> means “any string of zero or more &lt;em>a&lt;/em>s”
&lt;ul>
&lt;li>Will match &lt;em>a&lt;/em> or &lt;em>aaaaaa&lt;/em>&lt;/li>
&lt;li>Also match &lt;em>Off Minor&lt;/em> (since the string &lt;em>Off Minor&lt;/em> has zero &lt;em>a&lt;/em>’s)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="one-or-more--the-kleene-">&lt;strong>One or more&lt;/strong>: &lt;code>+&lt;/code> (the Kleene +)&lt;/h3>
&lt;ul>
&lt;li>&amp;ldquo;at least one&amp;rdquo; of some character (“one or more occurrences of the immediately preceding character or regular expression”)&lt;/li>
&lt;li>&lt;code>/[0-9]+/&lt;/code> &lt;em>is the normal way to specify “a sequence of digits”&lt;/em>&lt;/li>
&lt;/ul>
&lt;h3 id="wildcard-expression-">&lt;strong>Wildcard&lt;/strong> expression: &lt;code>.&lt;/code>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>matches any single character (&lt;em>except&lt;/em> a carriage return)&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-31%2015.21.38.png" alt="截屏2020-05-31 15.21.38" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>Often used together with the Kleene star &lt;code>*&lt;/code> to mean “any string of characters”&lt;/p>
&lt;ul>
&lt;li>E.g. suppose we want to find any line in which a particular word, for example, &lt;em>aardvark&lt;/em>, appears twice. We can specify this with &lt;code>/aardvark.*aardvark/&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="anchors">&lt;strong>Anchors&lt;/strong>&lt;/h3>
&lt;p>&lt;strong>special characters that anchor regular expressions to particular places in a string&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;code>^&lt;/code>: start of a line&lt;/p>
&lt;ul>
&lt;li>&lt;code>/ˆThe/&lt;/code>matches the word &lt;em>The&lt;/em> only at the start of a line.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;code>$&lt;/code>: end of the line&lt;/p>
&lt;ul>
&lt;li>&lt;code>/ˆThe dog\.$/&lt;/code>matches a line that contains only the phrase &lt;em>The dog&lt;/em>.
&lt;ul>
&lt;li>(We have to use the backslash here since we want the . to mean “period” and not the wildcard)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;code>/b&lt;/code>: word boundary&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;code>/\bthe\b/&lt;/code> matches the word &lt;em>the&lt;/em> but not the word &lt;em>other&lt;/em>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>A “word” for the purposes of a regular expression is defined as any sequence of digits, underscores, or letters (based on the definition of “words” in programming languages)&lt;/p>
&lt;p>E.g., &lt;code>/\b99\b/&lt;/code> will&lt;/p>
&lt;ul>
&lt;li>match the string &lt;em>99&lt;/em> in &lt;em>There are 99 bottles of beer on the wall&lt;/em> (because 99 follows a space) ✅&lt;/li>
&lt;li>but NOT &lt;em>99&lt;/em> in &lt;em>There are 299 bottles of beer on the wall&lt;/em> (since 99 follows a number) ❌&lt;/li>
&lt;li>match &lt;em>99&lt;/em> in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore, or letter)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;code>/B&lt;/code>: non-boundary&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="disjunction-grouping-and-precedence">Disjunction, Grouping, and Precedence&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Disjunction&lt;/strong> operator/&lt;strong>pipe&lt;/strong> symbol: &lt;code>|&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;code>/cat|dog/&lt;/code> matches either the string &lt;em>cat&lt;/em> or the string &lt;em>dog&lt;/em>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Parenthesis operator: &lt;code>(&lt;/code> and &lt;code>)&lt;/code>&lt;/p>
&lt;ul>
&lt;li>Make the disjunction operator apply only to a specific pattern
&lt;ul>
&lt;li>&lt;code>/gupp(y|ies)/&lt;/code> matches either &lt;em>guppy&lt;/em> or &lt;em>guppies&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Groups the whole pattern
&lt;ul>
&lt;li>we have a line that has column labels of the form &lt;em>Column 1 Column 2 Column 3&lt;/em>. With the parentheses, we could write the expression &lt;code>/(Column␣[0-9]+␣)/&lt;/code> to match the word &lt;em>Column&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Operator precedence hierarchy&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The following table gives the order of RE operator precedence, from highest precedence to lowest precedence&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-31%2023.09.49.png" alt="截屏2020-05-31 23.09.49" style="zoom: 67%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Greedy and non-greedy matching&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Greedy&lt;/strong>: expanding to cover as much of a string as they can (always match the &lt;em>largest&lt;/em> string they can)&lt;/li>
&lt;li>&lt;strong>Non-greedy&lt;/strong>: matches as little text as possible
&lt;ul>
&lt;li>Use &lt;code>?&lt;/code> qualifier to enforce non-greedy matching&lt;/li>
&lt;li>&lt;code>?*&lt;/code>&lt;/li>
&lt;li>&lt;code>?+&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="example">Example&lt;/h2>
&lt;p>Suppose we wanted to write a RE to find cases of the English article &lt;em>the&lt;/em>.&lt;/p>
&lt;p>A simple (but incorrect) pattern might be: &lt;code>/the/&lt;/code>&lt;/p>
&lt;ul>
&lt;li>&lt;span style="color:red">Problem: this pattern will miss the word when it begins a sentence and hence is capitalized (i.e., &lt;em>The&lt;/em>)&lt;/span>&lt;/li>
&lt;/ul>
&lt;p>This might lead us to the following pattern: &lt;code>/[tT]he/&lt;/code>&lt;/p>
&lt;ul>
&lt;li>&lt;span style="color:red">Problem: still incorrectly return texts with the embedded in other words (e.g., &lt;em>other&lt;/em> or &lt;em>theology&lt;/em>).&lt;/span>&lt;/li>
&lt;/ul>
&lt;p>We need to specify that we want instances with a word bound- ary on both sides: &lt;code>/\b[tT]he\b/&lt;/code>&lt;/p>
&lt;p>Suppose we wanted to do this without the use of &lt;code>/\b/&lt;/code> since &lt;code>/\b/&lt;/code> won’t treat underscores and numbers as word boundaries; but we might want to find &lt;em>the&lt;/em> in some context where it might also have underlines or numbers nearby (&lt;em>the_&lt;/em> or &lt;em>the25&lt;/em>). We need to specify that we want instances in which there are no alphabetic letters on either side of the &lt;em>the&lt;/em>: &lt;code>/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/&lt;/code>&lt;/p>
&lt;ul>
&lt;li>&lt;span style="color:red">Problem: it won’t find the word &lt;em>the&lt;/em> when it begins a line.&lt;/span>&lt;/li>
&lt;/ul>
&lt;p>We can avoid this by specifying that before the &lt;em>the&lt;/em> we require &lt;em>either&lt;/em> the beginning-of-line or a non-alphabetic character, and the same at the end of the line:&lt;/p>
&lt;p>&lt;code>/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/&lt;/code>&lt;/p>
&lt;blockquote>
&lt;p>The process we just went through was based on fixing two kinds of errors:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>false positives&lt;/strong>, strings that we incorrectly matched like &lt;em>other&lt;/em> or &lt;em>there&lt;/em>,&lt;/li>
&lt;li>&lt;strong>false negatives&lt;/strong>, strings that we incorrectly missed, like &lt;em>The&lt;/em>.&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;h2 id="more-operators">More operators&lt;/h2>
&lt;h3 id="common-sets-of-characters">Common sets of characters&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-31%2023.28.13.png" alt="截屏2020-05-31 23.28.13" style="zoom:80%;" />
&lt;h3 id="counting">Counting&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-31%2023.28.40.png" alt="截屏2020-05-31 23.28.40" style="zoom:80%;" />
&lt;h3 id="special-characters-based-on-the-backslash-">Special characters based on the backslash (&lt;code>\&lt;/code>)&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-31%2023.29.30.png" alt="截屏2020-05-31 23.29.30" style="zoom:80%;" />
&lt;h2 id="substitution-capture-groups">Substitution, Capture Groups&lt;/h2>
&lt;p>&lt;strong>Substitution&lt;/strong> operator: &lt;code>s/regexp1/pattern/&lt;/code>&lt;/p>
&lt;ul>
&lt;li>Allows a string characterized by a regular expression to be replaced by another string&lt;/li>
&lt;/ul>
&lt;p>Refer to a particular subpart of the string matching the first pattern&lt;/p>
&lt;ul>
&lt;li>we put parentheses ( and ) around the first pattern and use the number operator &lt;code>\1&lt;/code> in the second pattern to refer back&lt;/li>
&lt;li>Example
&lt;ul>
&lt;li>suppose we wanted to put angle brackets around all integers in a text, for example, changing &lt;em>the 35 boxes&lt;/em> to &lt;em>the&lt;/em> &amp;lt;&lt;em>35&lt;/em>&amp;gt; &lt;em>boxes&lt;/em>.&lt;/li>
&lt;li>We can implement like this: &lt;code>s/([0-9]+)/&amp;lt;\1&amp;gt;/&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>The parenthesis and number operators can also specify that a certain string or expression must occur twice in the text.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>E.g.: suppose we are looking for the pattern “the Xer they were, the Xer they will be”, where we want to constrain the two X’s to be the same string&lt;/p>
&lt;/li>
&lt;li>
&lt;p>We do this by surrounding the first X with the parenthesis operator, and replacing the second X with the number operator &lt;code>\1&lt;/code>&lt;/p>
&lt;p>&lt;code>/the (.*)er they were, the \1er they will be/&lt;/code>&lt;/p>
&lt;ul>
&lt;li>Here the &lt;code>\1&lt;/code> will be replaced by whatever string matched the first item in parentheses.&lt;/li>
&lt;li>So this will match &lt;em>the bigger they were, the bigger they will be&lt;/em> but not &lt;em>the bigger they were, the faster they will be&lt;/em>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>This use of parentheses to store a pattern in memory is called a &lt;strong>capture group&lt;/strong>. Every time a capture group is used (i.e., parentheses surround a pattern), the re- sulting match is stored in a &lt;em>numbered&lt;/em> &lt;strong>register&lt;/strong>. Similarly, the third capture group is stored in &lt;code>\3&lt;/code>, the fourth is &lt;code>\4&lt;/code>, and so on.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>E.g.: &lt;code>/the (.*)er they (.*), the \1er we \2/&lt;/code>&lt;/p>
&lt;p>will match &lt;em>the faster they ran, the faster we ran&lt;/em> but not &lt;em>the faster they ran, the faster we ate&lt;/em>.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Parentheses thus have a double function in regular expressions&lt;/p>
&lt;ul>
&lt;li>they are used to group terms for specifying the order in which operators should apply&lt;/li>
&lt;li>they are used to capture something in a register&lt;/li>
&lt;/ul>
&lt;p>Sometimes we might want to use parentheses for grouping, but do NOT want to capture the resulting pattern in a register. In that case we use a non-capturing group, which is specified by putting the commands &lt;code>?:&lt;/code> after the open paren, in the form &lt;code>(?: pattern )&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>E.g.:&lt;/p>
&lt;p>&lt;code>/(?:some|a few) (people|cats) like some \1/&lt;/code>&lt;/p>
&lt;p>will match &lt;em>some cats like some cats&lt;/em> but not &lt;em>some cats like some a few&lt;/em>.&lt;/p>
&lt;/li>
&lt;/ul></description></item><item><title>Minimum Edit Distance</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/re-text_normalization-edit_distance/02_3-min_edit_distance/</link><pubDate>Sun, 02 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/re-text_normalization-edit_distance/02_3-min_edit_distance/</guid><description>&lt;h2 id="definition">Definition&lt;/h2>
&lt;p>&lt;strong>Minimum edit distance&lt;/strong> between two strings $:=$ the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another.&lt;/p>
&lt;p>Example&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The gap between &lt;em>intention&lt;/em> and &lt;em>execution&lt;/em>, for example, is 5 (delete an &lt;code>i&lt;/code>, substitute &lt;code>e&lt;/code> for &lt;code>n&lt;/code>, substitute &lt;code>x&lt;/code> for &lt;code>t&lt;/code>, insert &lt;code>c&lt;/code>, substitute &lt;code>u&lt;/code> for &lt;code>n&lt;/code>).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Visualization&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-01%2013.03.34.png" alt="截屏2020-06-01 13.03.34" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;h2 id="levenshtein-distance">Levenshtein distance&lt;/h2>
&lt;p>Original version:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Each of the three operations (insertion, deletion, substitution) has a cost of 1&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The substitution of a letter for itself (E.g., &lt;code>t&lt;/code> for &lt;code>t&lt;/code>), has zero cost.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The Levenshtein distance between &lt;em>intention&lt;/em> and &lt;em>execution&lt;/em> is 5&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Alternative version:&lt;/p>
&lt;ul>
&lt;li>Insertion or deletion has a cost of 1&lt;/li>
&lt;li>Substitution has a cost of 2 (since any substitution can be represented by one insertion and one deletion)&lt;/li>
&lt;li>Using this version, the Levenshtein distance between &lt;em>intention&lt;/em> and &lt;em>execution&lt;/em> is 8.&lt;/li>
&lt;/ul>
&lt;h2 id="the-minimum-edit-distance-algorithm">The Minimum Edit Distance Algorithm&lt;/h2>
&lt;p>How do we find the minimum edit distance?&lt;/p>
&lt;p>💡 Think of this as a search task, in which we are searching for the &lt;strong>shortest path&lt;/strong>—a sequence of edits—from one string to another.&lt;/p>
&lt;ul>
&lt;li>Just remember the shortest path to a state each time we saw it.
&lt;ul>
&lt;li>We can do this by using &lt;strong>dynamic programming&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="dynamic-programming">&lt;strong>Dynamic programming&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>💡 Intuition: a large problem can be solved by properly combining the solutions to various sub-problems&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Apply a table-driven method to solve problems by combining solutions to sub-problems&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example: Consider the shortest path of transformed words that represents the minimum edit distance between the strings &lt;em>intention&lt;/em> and &lt;em>execution&lt;/em>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-02%2009.34.39.png" alt="截屏2020-06-02 09.34.39" style="zoom:70%;" />
&lt;blockquote>
&lt;p>Imagine some string (perhaps it is &lt;em>exention&lt;/em>) that is in this optimal path (whatever it is). The intuition of dynamic programming is that if &lt;em>exention&lt;/em> is in the optimal operation list, then the optimal sequence must also include the optimal path from &lt;em>intention&lt;/em> to &lt;em>exention&lt;/em>. Why? If there were a shorter path from &lt;em>intention&lt;/em> to &lt;em>exention&lt;/em>, then we could use it instead, resulting in a shorter overall path, and the optimal sequence wouldn’t be optimal, thus leading to a contradiction.&lt;/p>
&lt;/blockquote>
&lt;/li>
&lt;/ul>
&lt;h3 id="minimum-edit-distance-algorithm">Minimum Edit Distance Algorithm&lt;/h3>
&lt;p>Define the minimum edit distance between two string&lt;/p>
&lt;ul>
&lt;li>Given:
&lt;ul>
&lt;li>Source string $X$ of length $n$&lt;/li>
&lt;li>Target string $Y$ of length m&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>$D[i, j]:=$ edit distance between $X[1..i]$ and $Y[1..j]$ (the first $i$ characters of $X$ and the first $j$ characters of $Y$)&lt;/li>
&lt;li>Thus, The edit distance between $X$ and $Y$ is $D[n, m]$&lt;/li>
&lt;/ul>
&lt;p>We’ll use dynamic programming to compute $D[n, m]$ &lt;strong>bottom up&lt;/strong>, combining solutions to subproblems.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Base case:&lt;/p>
&lt;ul>
&lt;li>source substring of length $i$ but an empty target string, going from $i$ characters to 0 requires $i$ deletes.&lt;/li>
&lt;li>target substring of length $j$ but an empty source going from 0 characters to $j$ characters requires $j$ inserts&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Having computed $D[i,j]$ for small $i, j$ we then compute larger $D[i,j]$ based on previously computed smaller values. The value of is$D[i,j]$ computed by taking the minimum of the three possible paths through the matrix which arrive there:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200802235633719.png" alt="image-20200802235633719" style="zoom:15%;" />
&lt;p>If we assume the version of Levenshtein distance in which the insertions and deletions each have a cost of 1 ($\text { ins-cost }(\cdot)=\operatorname{del-cost}(\cdot)=1$), and substitutions have a cost of 2 (except substitution of identical letters have zero cost), the computation for $D[i,j]$ becomes:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200802235915637.png" alt="image-20200802235915637" style="zoom:15%;" />
&lt;/li>
&lt;/ul>
&lt;h3 id="pseudocode">Pseudocode&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-02%2010.28.09.png" alt="截屏2020-06-02 10.28.09" style="zoom:80%;" />
&lt;h3 id="example">Example&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-02%2010.30.04.png" alt="截屏2020-06-02 10.30.04" style="zoom:70%;" />
&lt;h2 id="minimum-cost-alignment">Minimum Cost Alignment&lt;/h2>
&lt;p>With a small change, the edit distance algorithm can also provide the minimum cost &lt;strong>alignment&lt;/strong> between two strings.&lt;/p>
&lt;p>To extend the edit distance algorithm to produce an alignment, we can start by visualizing an alignment as a path through the edit distance matrix.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-02%2010.43.24.png" alt="截屏2020-06-02 10.43.24" style="zoom:75%;" />
&lt;ul>
&lt;li>Boldfaced cell: represents an alignment of a pair of letters in the two strings.
&lt;ul>
&lt;li>If two boldfaced cells occur in the same row, there will be an insertion in going from the source to the target&lt;/li>
&lt;li>two boldfaced cells in the same column indicate a deletion.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Computation:&lt;/p>
&lt;ol>
&lt;li>we augment the minimum edit distance algorithm to store backpointers in each cell.
&lt;ul>
&lt;li>The backpointer from a cell points to the previous cell (or cells) that we came from in entering the current cell.&lt;/li>
&lt;li>Some cells have mul- tiple backpointers because the minimum extension could have come from multiple previous cells.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>we perform a &lt;strong>backtrace&lt;/strong>.
&lt;ul>
&lt;li>we start from the last cell (at the final row and column), and follow the pointers back through the dynamic programming matrix. Each complete path between the final cell and the initial cell is a minimum distance alignment.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol></description></item><item><title>Words and Text Normalization</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/re-text_normalization-edit_distance/02_2-words_and_text_normalization/</link><pubDate>Sun, 02 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/re-text_normalization-edit_distance/02_2-words_and_text_normalization/</guid><description>&lt;h2 id="tldr">TL;DR&lt;/h2>
&lt;ul>
&lt;li>Two ways for counting words
&lt;ul>
&lt;li>Number of wordform types
&lt;ul>
&lt;li>Relationship between #Types and #Tokens: &lt;strong>Heap&amp;rsquo;s law&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Number of lemmas&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Text Normalization
&lt;ol>
&lt;li>Tokenizing (segmenting) words
&lt;ul>
&lt;li>Bype-pair Encoding (BPE)&lt;/li>
&lt;li>Wordpiece&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Normalizing word formats
&lt;ul>
&lt;li>Word normalisation
&lt;ul>
&lt;li>case folding&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Lemmatization&lt;/li>
&lt;li>Stemming
&lt;ul>
&lt;li>Porter Stemmer&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Segmenting sentences&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;h2 id="definition">Definition&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Corpus&lt;/strong> (pl. &lt;strong>corpora&lt;/strong>): a computer-readable collection of text or speech.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Lemma&lt;/strong>: a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense.&lt;/p>
&lt;ul>
&lt;li>E.g.: &lt;code>cats&lt;/code> and &lt;code>cat&lt;/code> have the same lemma &lt;code>cat&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Wordform&lt;/strong>: full inflected or derived form of the word&lt;/p>
&lt;ul>
&lt;li>E.g.: &lt;code>cats&lt;/code> and &lt;code>cat&lt;/code> have the same lemma &lt;code>cat&lt;/code> but are different wordforms&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>How many words are there in English?&lt;/p>
&lt;p>To answer this question we need to distinguish two ways of talking about words.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>One way: number of wordform types&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Type&lt;/strong>: number of distinct words in a corpus&lt;/p>
&lt;ul>
&lt;li>
&lt;p>if the set of words in the vocabulary is $V$ , the number of types is the vocabulary size $|V|$.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>When we speak about the number of words in the language, we are generally referring to word types.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The larger the corpora we look at, the more word types we find&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Tokens&lt;/strong>: total number $N$ of running words&lt;/p>
&lt;ul>
&lt;li>
&lt;p>E.g.: If we ignore punctuation, the following Brown sentence has 16 tokens and 14 types:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-tex" data-lang="tex">&lt;span class="line">&lt;span class="cl">They picnicked by the pool, then lay back on the grass and looked at the stars.
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>
&lt;p>Relationship between the number of types $|V|$ and number of tokens $N$: &lt;strong>Herdan&amp;rsquo;s Law&lt;/strong> or &lt;strong>Heap&amp;rsquo;s Law&lt;/strong>
&lt;/p>
$$
|V|=k N^{\beta}
$$
&lt;ul>
&lt;li>$k$: positive constant&lt;/li>
&lt;li>$\beta \in (0, 1)$
&lt;ul>
&lt;li>depends on the corpus size and the genre&lt;/li>
&lt;li>for the large corpora ranges from 0.67 to 0.75&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Another way: number of lemmas&lt;/p>
&lt;ul>
&lt;li>Dictionary &lt;strong>entries&lt;/strong> or &lt;strong>boldface&lt;/strong> forms are a very rough upper bound on the number of lemmas&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="text-normalization">Text Normalization&lt;/h2>
&lt;p>three tasks are commonly applied as part of any normalization process:&lt;/p>
&lt;ol>
&lt;li>&lt;a href="#word-tokenization">Tokenizing (segmenting) words&lt;/a>&lt;/li>
&lt;li>&lt;a href="#word-nomalization-lemmatization-and-stemming">Normalizing word formats&lt;/a>&lt;/li>
&lt;li>&lt;a href="#sentence-segmentation">Segmenting sentences&lt;/a>&lt;/li>
&lt;/ol>
&lt;h3 id="word-tokenization">Word Tokenization&lt;/h3>
&lt;p>&lt;strong>Tokenization&lt;/strong>: tasks of segmenting running text into words.&lt;/p>
&lt;p>For most NLP applications we’ll need to keep numbers and punctuation in our tokenization&lt;/p>
&lt;ul>
&lt;li>punctuation
&lt;ul>
&lt;li>as a separate token
&lt;ul>
&lt;li>commas &lt;code>,&lt;/code>: useful piece of information for parsers&lt;/li>
&lt;li>periods &lt;code>.&lt;/code>: help indicate sentence boundaries&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>we also want to keep the punctuation that occurs word internally
&lt;ul>
&lt;li>E.g.: &lt;em>m.p.h,&lt;/em>, &lt;em>Ph.D.&lt;/em>, &lt;em>AT&amp;amp;T&lt;/em>, &lt;em>cap’n&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Special characters and numbers need to be keep in
&lt;ul>
&lt;li>prices &lt;em>($45.55)&lt;/em>&lt;/li>
&lt;li>dates &lt;em>(01/02/06)&lt;/em>&lt;/li>
&lt;li>URLs &lt;em>(&lt;a href="http://www.stanford.edu">http://www.stanford.edu&lt;/a>)&lt;/em>&lt;/li>
&lt;li>Twitter hashtags &lt;em>(#nlproc)&lt;/em>&lt;/li>
&lt;li>email address &lt;em>(&lt;a href="mailto:someone@cs.colorado.edu">someone@cs.colorado.edu&lt;/a>)&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>A tokenizer can be used to&lt;/p>
&lt;ul>
&lt;li>
&lt;p>expand &lt;strong>clitic&lt;/strong> contractions that are marked by apostrophes&lt;/p>
&lt;ul>
&lt;li>&lt;code>what're&lt;/code> -&amp;gt; &lt;code>what are&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>named entity detection&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>tokenize multiword expressions like &lt;code>New York&lt;/code> or &lt;code>rock ’n’ roll&lt;/code> as a single token, which requires a multiword expression dictionary of some sort.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Commonly used tokenization standard: &lt;strong>Penn Treebank tokenization standard&lt;/strong>&lt;/p>
&lt;h3 id="byte-pair-encoding-for-tokenization">Byte-Pair Encoding for Tokenization&lt;/h3>
&lt;p>💡 Instead of defining tokens as words (defined by spaces in orthographies that have spaces, or more complex algorithms), or as characters (as in Chinese), &lt;strong>we can use our data to automatically tell us what size tokens should be.&lt;/strong>&lt;/p>
&lt;p>&lt;strong>Morpheme&lt;/strong>: smallest meaning-bearing unit of a language&lt;/p>
&lt;ul>
&lt;li>E.g.: the word &lt;code>unlikeliest&lt;/code> has the morphemes &lt;code>un-&lt;/code>, &lt;code>likely&lt;/code>, and &lt;code>-est&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>One reason it’s helpful to have &lt;strong>subword&lt;/strong> tokens is to deal with unknown words.&lt;/p>
&lt;blockquote>
&lt;p>Unknown words are particularly relevant for machine learning systems. Machine learning systems often learn some facts about words in one corpus (a training corpus) and then use these facts to make decisions about a separate test corpus and its words. Thus if our training corpus contains, say the words &lt;code>low&lt;/code>, and &lt;code>lowest&lt;/code>, but not &lt;code>lower&lt;/code>, but then the word &lt;em>lower&lt;/em> appears in our test corpus, our system will not know what to do with it. 🤪&lt;/p>
&lt;/blockquote>
&lt;p>🔧 Solution: use a kind of tokenization in which most tokens are words, but some tokens are frequent morphemes or other subwords like &lt;code>-er&lt;/code>, so that an unseen word can be represented by combining the parts.&lt;/p>
&lt;p>Simplest algorithm: &lt;strong>byte-pair encoding (BPE)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>💡 Intuition: iteratively merge frequent pairs of characters&lt;/p>
&lt;/li>
&lt;li>
&lt;p>How it works?&lt;/p>
&lt;ul>
&lt;li>Begins with the set of symbols equal to the set of characters.
&lt;ul>
&lt;li>Each word is represented as a sequence of characters plus a special end-of-word symbol &lt;code>_&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>At each step of the algorithm, we count the number of symbol pairs, find the most frequent pair (‘A’, ‘B’), and replace it with the new merged symbol (‘AB’)&lt;/li>
&lt;li>We continue to count and merge, creating new longer and longer character strings, until we’ve done $k$ merges ($k$ is a parameter of the algorithm)&lt;/li>
&lt;li>The resulting symbol set will consist of the original set of characters plus $k$ new symbols.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>The algorithm is run &lt;em>inside&lt;/em> words (we don’t merge across word boundaries). For this reason, the algorithm can take as input a dictionary of words together with counts.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;p>Consider the following tiny input dictionary with counts for each word,&lt;/p>
&lt;p>which would have the starting vocabulary of 11 letters&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-01%2011.50.28.png" alt="截屏2020-06-01 11.50.28" style="zoom:80%;" />
&lt;ul>
&lt;li>
&lt;p>We first count all pairs of symbols: the most frequent is the pair (&lt;code>r&lt;/code>, &lt;code>_&lt;/code>) because it occurs in &lt;em>newer&lt;/em> (frequency of 6) and &lt;em>wider&lt;/em> (frequency of 3) for a total of 9 occurrences. We then merge these symbols, treating &lt;code>r_&lt;/code> as one symbol, and count again.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-01%2011.52.05.png" alt="截屏2020-06-01 11.52.05" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>Now the most frequent pair is (&lt;code>e&lt;/code>, &lt;code>r_&lt;/code>) , which we merge; our system has learned that there should be a token for word-final &lt;code>er&lt;/code>, represented as &lt;code>er_&lt;/code>&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-01%2011.53.38.png" alt="截屏2020-06-01 11.53.38" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>Next (&lt;code>e&lt;/code>, &lt;code>w&lt;/code>) (total count of 8) get merged to &lt;code>ew&lt;/code>:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-01%2011.54.53.png" alt="截屏2020-06-01 11.54.53" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>If we continue, the next merges are:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-01%2011.55.22.png" alt="截屏2020-06-01 11.55.22" style="zoom:80%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Test&lt;/p>
&lt;ul>
&lt;li>When we need to tokenize a test sentence, we just run the merges we have learned, greedily, in the order we learned them, on the test data. (Thus the frequencies in the test data don’t play a role, just the frequencies in the training data).
&lt;ul>
&lt;li>first we segment each test sentence word into characters.&lt;/li>
&lt;li>Then we apply the first rule: replace every instance of &lt;code>r&lt;/code> &lt;code>_&lt;/code> in the test corpus with &lt;code>r_&lt;/code> ; and then the second rule: replace every instance of &lt;code>e&lt;/code> &lt;code>r_&lt;/code> in the test corpus with &lt;code>er_&lt;/code>, and so on.&lt;/li>
&lt;li>By the end, if the test corpus contained the word &lt;code>n e w e r _ &lt;/code>, it would be tokenized as a full word. But a new (unknown) word like &lt;code>l o w e r _&lt;/code> would be merged into the two tokens &lt;code>low&lt;/code> &lt;code>er_&lt;/code> .&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>In real algorithms BPE&lt;/p>
&lt;ul>
&lt;li>run with many thousands of merges on a very large input dictionary&lt;/li>
&lt;li>Result: most words will be represented as full symbols, and only the very rare words (and unknown words) will have to be represented by their parts.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="wordpiece-and-greedy-tokenization">Wordpiece and Greedy Tokenization&lt;/h4>
&lt;p>The &lt;strong>wordpiece&lt;/strong> algorithm starts with some simple tokenization (such as by whitespace) into rough words, and then breaks those rough word tokens into subword tokens.&lt;/p>
&lt;p>&lt;strong>Difference from BPE&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The special word-boundary token &lt;code>_&lt;/code> appears at the &lt;strong>beginning&lt;/strong> of words (rather than at the end)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Rather than merging the pairs that are most &lt;em>frequent&lt;/em>, wordpiece instead merges the pairs that minimizes the language model likelihood of the training data.&lt;/p>
&lt;p>(the wordpiece model chooses the two tokens to combine that would give the training corpus the &lt;strong>highest&lt;/strong> probability )&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>How it works&lt;/strong>?&lt;/p>
&lt;ul>
&lt;li>
&lt;p>An input sentence or string is first split by some simple basic tokenizer (like whitespace) into a series of rough word tokens.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Then instead of using a word boundary token, word-initial subwords are distinguished from those that do not start words by marking internal subwords with special symbols &lt;code>##&lt;/code>&lt;/p>
&lt;ul>
&lt;li>we might split &lt;code>unaffable&lt;/code> into [&lt;code>un&lt;/code>, &lt;code>##aff&amp;quot;&lt;/code>, &lt;code>##able&lt;/code>]&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Then each word token string is tokenized using a &lt;strong>greedy longest-match-first&lt;/strong> algorithm.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Also called &lt;strong>maximum matching&lt;/strong> or &lt;strong>MaxMatch&lt;/strong>.&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-01%2012.28.54.png" alt="截屏2020-06-01 12.28.54" style="zoom:80%;" />
&lt;ul>
&lt;li>Given a vocabulary (a learned list of wordpiece tokens) and a string&lt;/li>
&lt;li>Starts by pointing at the beginning of a string&lt;/li>
&lt;li>It chooses the longest token in the wordpiece vocabulary that matches the input at the current position, and moves the pointer past that word in the string.&lt;/li>
&lt;li>The algorithm is then applied again starting from the new pointer position.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example&lt;/strong>:&lt;/p>
&lt;p>Given the token &lt;code>intention&lt;/code> and the dictionary:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">[&amp;#34;in&amp;#34;, &amp;#34;tent&amp;#34;,&amp;#34;intent&amp;#34;,&amp;#34;##tent&amp;#34;, &amp;#34;##tention&amp;#34;, &amp;#34;##tion&amp;#34;, &amp;#34;#ion&amp;#34;]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The tokenizer would choose &lt;code>intent&lt;/code> (because it is longer than &lt;code>in&lt;/code>, and then &lt;code>##ion&lt;/code> to complete the string, resulting in the tokenization &lt;code>[&amp;quot;intent&amp;quot; &amp;quot;##ion&amp;quot;]&lt;/code>.&lt;/p>
&lt;h3 id="word-normalization-lemmatization-and-stemming">Word Normalization, Lemmatization and Stemming&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Word normalization&lt;/strong>: task of putting words/tokens in a standard format, choosing a single normal form for words with multiple forms&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Case folding&lt;/strong>: Mapping everything to lower case
&lt;ul>
&lt;li>&lt;code>Woodchuck&lt;/code> and &lt;code>woodchuck&lt;/code> are represented identically&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>For many natural language processing situations we also want two morphologically different forms of a word to behave similarly.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Lemmatization&lt;/strong>: task of determining that two words have the same root, despite their surface differences.&lt;/p>
&lt;ul>
&lt;li>E.g.
&lt;ul>
&lt;li>&lt;code>am&lt;/code>, &lt;code>are&lt;/code>, and &lt;code>is&lt;/code> have the shared lemma &lt;code>be&lt;/code>&lt;/li>
&lt;li>&lt;code>dinner&lt;/code> and &lt;code>dinners&lt;/code> both have the lemma &lt;em>dinner&lt;/em>&lt;/li>
&lt;li>The lemmatized form of a sentence like &lt;code>He is reading detective stories&lt;/code> would thus be &lt;code>He be read detective story&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Method: complete &lt;strong>morphological parsing&lt;/strong> of the word.
&lt;ul>
&lt;li>&lt;strong>Morphology&lt;/strong>: study of the way words are built up from smaller meaning-bearing units called &lt;strong>morphemes&lt;/strong>.&lt;/li>
&lt;li>Two board classes of morphemes
&lt;ul>
&lt;li>&lt;strong>Stems&lt;/strong>: the central morpheme of the word, supplying the main meaning&lt;/li>
&lt;li>&lt;strong>Affixes&lt;/strong>: adding &amp;ldquo;additional&amp;rdquo; meanings of various kinds&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>E.g.:
&lt;ul>
&lt;li>the word &lt;code>fox&lt;/code> consists of one morpheme (the morpheme &lt;code>fox&lt;/code>)&lt;/li>
&lt;li>the word &lt;code>cats&lt;/code> consists of two: the morpheme &lt;code>cat&lt;/code> and the morpheme &lt;code>-s&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Stemming&lt;/strong>: naive version of morphological analysis&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Most widely used stemming algorithms: the &lt;strong>Porter Stemmer&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Based on series of rewrite rules run in series, as a &lt;strong>cascade&lt;/strong>, in which the output of each pass is fed as input to the next pass&lt;/p>
&lt;p>Sampling of rules:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-06-01%2012.49.59.png" alt="截屏2020-06-01 12.49.59" style="zoom:80%;" />
&lt;/li>
&lt;li>
&lt;p>Simple stemmers can be useful in cases where we need to collapse across different variants of the same lemma&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Nonetheless, they do tend to commit errors of both over- and under-generalizing 🤪&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="sentence-segmentation">Sentence Segmentation&lt;/h3>
&lt;p>The most useful cues for segmenting a text into sentences are &lt;strong>punctuation&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Question marks and exclamation points are relatively unambiguous markers of sentence boundaries 👏&lt;/li>
&lt;li>Periods are more ambiguous 🤪
&lt;ul>
&lt;li>The period character “.” is ambiguous between a sentence boundary marker and a marker of abbreviations like &lt;code>Mr.&lt;/code> or `Inc. (the final period of &lt;em>Inc.&lt;/em> marked both an abbreviation and the sentence boundary marker 🤪)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Sentence tokenization methods work by first deciding (based on rules or machine learning) whether a period is part of the word or is a sentence-boundary marker.&lt;/p>
&lt;ul>
&lt;li>An abbreviation dictionary can help determine whether the period is part of a commonly used abbreviation&lt;/li>
&lt;/ul>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://web.stanford.edu/~jurafsky/slp3/2.pdf">Regular Expressions, Text Normalization, and Edit Distance&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>