<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>POS Tagging | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/pos-tagging/</link><atom:link href="https://haobin-tan.netlify.app/tags/pos-tagging/index.xml" rel="self" type="application/rss+xml"/><description>POS Tagging</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 03 Aug 2020 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>POS Tagging</title><link>https://haobin-tan.netlify.app/tags/pos-tagging/</link></image><item><title>POS Taggig</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/pos-tagging/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/pos-tagging/</guid><description/></item><item><title>Part-of-Speech Tagging</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/pos-tagging/pos-tagging/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/pos-tagging/pos-tagging/</guid><description>&lt;p>Parts of speech (a.k.a &lt;strong>POS&lt;/strong>, &lt;strong>word classes&lt;/strong>, or &lt;strong>syntactic categories&lt;/strong>) are useful because they reveal a lot about a word and its neighbors&lt;/p>
&lt;p>E.g.: Knowing whether a word is a noun or a verb tells us about likely neighboring words&lt;/p>
&lt;h2 id="style-convention">Style Convention&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Bold&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>new terms/concept/definition&lt;/p>
&lt;/li>
&lt;li>
&lt;p>important points&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;em>Italic&lt;/em>: examples&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="mostly-english-word-classes">(Mostly) English Word Classes&lt;/h2>
&lt;p>Parts of Speech&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>closed&lt;/strong> class&lt;/p>
&lt;ul>
&lt;li>
&lt;p>close $\equiv$ relatively fixed membership&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Generally &lt;strong>function words&lt;/strong>, which&lt;/p>
&lt;ul>
&lt;li>tend to be very short&lt;/li>
&lt;li>occur frequently&lt;/li>
&lt;li>often have structuring uses in grammar&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Important closed class in English:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Prepositions&lt;/strong> &lt;em>(介词)&lt;/em>&lt;/p>
&lt;ul>
&lt;li>&lt;em>on, under, over, near, by, at, from, to, with&lt;/em>&lt;/li>
&lt;li>Occur before noun phrases&lt;/li>
&lt;li>Often indicate spatial or temporal relations
&lt;ul>
&lt;li>literal (&lt;em>on it&lt;/em>, &lt;em>before then&lt;/em>, &lt;em>by the house&lt;/em>)&lt;/li>
&lt;li>metaphorical (&lt;em>on time&lt;/em>, &lt;em>with gusto&lt;/em>, &lt;em>beside herself&lt;/em>)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Also indicate other relations
&lt;ul>
&lt;li>&lt;em>Hamlet was written &lt;u>by&lt;/u> Shakespeare&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Particle&lt;/strong> &lt;em>（小品词）&lt;/em>&lt;/p>
&lt;ul>
&lt;li>&lt;em>up, down, on, off, in, out, at, by&lt;/em>&lt;/li>
&lt;li>Resembles a preposition or an adverb&lt;/li>
&lt;li>Used in combination with a verb&lt;/li>
&lt;li>Particles often have extended meanings that aren’t quite the same as the prepositions they resemble&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Determiner&lt;/strong> （&lt;em>限定词&lt;/em>）&lt;/p>
&lt;ul>
&lt;li>Occurs with nouns, often marking the beginning of a noun phrase
&lt;ul>
&lt;li>&lt;strong>article&lt;/strong> &lt;em>（冠词）&lt;/em>
&lt;ul>
&lt;li>indefinite: &lt;em>a, an&lt;/em>&lt;/li>
&lt;li>definite: &lt;em>the&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Other determiner: &lt;em>this, that&lt;/em> (&lt;em>this chapter, that page&lt;/em>)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Conjunctions&lt;/strong> &lt;em>（连词）&lt;/em>: join two phrases, clauses, or sentences&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Coordinating conjunctions&lt;/strong>: join two elements of equal status
&lt;ul>
&lt;li>&lt;em>and, or, but&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Subordinating conjunctions&lt;/strong>: when one of the elements has some embedded status
&lt;ul>
&lt;li>&lt;em>I thought that you might like some milk&lt;/em>
&lt;ul>
&lt;li>main clause: &lt;em>I thought&lt;/em>&lt;/li>
&lt;li>subordinate clause: &lt;em>you might like some milk&lt;/em>&lt;/li>
&lt;li>Subordinating conjunctions like &lt;em>that&lt;/em> which link a verb to its argument in this way are also called &lt;strong>complementizers&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Pronouns&lt;/strong> &lt;em>（代词）&lt;/em>: often act as a kind of shorthand for referring to some noun phrase or entity or event&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Personal pronouns&lt;/strong> &lt;em>（人称代词）&lt;/em>: refer to persons or entities
&lt;ul>
&lt;li>&lt;em>you&lt;/em>, &lt;em>she&lt;/em>, &lt;em>I&lt;/em>, &lt;em>it&lt;/em>, &lt;em>me&lt;/em>, etc&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>**Possessive pronouns ** &lt;em>（物主代词）&lt;/em>: forms of personal pronouns that indicate either actual possession or more often just an abstract relation between the person and some object
&lt;ul>
&lt;li>&lt;em>my, your, his, her, its, one’s, our, their&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Wh-pronouns&lt;/strong>: are used in certain question forms
&lt;ul>
&lt;li>&lt;em>what, who, whom, whoever&lt;/em>&lt;/li>
&lt;li>may also act as complementizers
&lt;ul>
&lt;li>&lt;em>Frida, who married Diego. . .&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>auxiliary verbs&lt;/strong> &lt;em>（助动词）&lt;/em>&lt;/p>
&lt;ul>
&lt;li>mark semantic features of a main verb
&lt;ul>
&lt;li>whether an action takes place in the present, past, or future (tense)&lt;/li>
&lt;li>whether it is completed (aspect)&lt;/li>
&lt;li>whether it is negated (polarity)&lt;/li>
&lt;li>whether an action is necessary, possible, suggested, or desired (mood)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Copula verb&lt;/strong> &lt;em>（系动词）&lt;/em>:
&lt;ul>
&lt;li>&lt;em>be&lt;/em>
&lt;ul>
&lt;li>connects subjects with certain kinds of predicate nominals and adjectives
&lt;ul>
&lt;li>&lt;em>He &lt;u>is&lt;/u> a duck&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>is used as part of the passive (&lt;em>We &lt;u>were&lt;/u> robbed&lt;/em>) or progressive (&lt;em>We &lt;u>are&lt;/u> leaving&lt;/em>) constructions&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;em>have&lt;/em>
&lt;ul>
&lt;li>mark the perfect tenses
&lt;ul>
&lt;li>&lt;em>I &lt;u>have&lt;/u> gone&lt;/em>&lt;/li>
&lt;li>&lt;em>I &lt;u>had&lt;/u> gone&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>**Modal verbs ** &lt;em>（情态动词）&lt;/em>: mark the mood associated with the event depicted by the main verb
&lt;ul>
&lt;li>&lt;em>can&lt;/em>: indicates ability or possibility&lt;/li>
&lt;li>&lt;em>may&lt;/em>: indicates permission or possibility&lt;/li>
&lt;li>&lt;em>must&lt;/em>: indicates necessity&lt;/li>
&lt;li>There is also a modal use of &lt;em>have&lt;/em> (e.g., &lt;em>I &lt;u>have&lt;/u> to go&lt;/em>).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>open&lt;/strong> class&lt;/p>
&lt;ul>
&lt;li>open $\equiv$ words are continually being created or borrowed&lt;/li>
&lt;li>Four major open classes:
&lt;ul>
&lt;li>&lt;strong>Nouns&lt;/strong>: include concrete terms (&lt;em>ship&lt;/em> and &lt;em>chair&lt;/em>), abstractions (&lt;em>bandwidth&lt;/em> and &lt;em>relationship&lt;/em>), and verb-like terms (&lt;em>pacing&lt;/em> as in &lt;em>His pacing to and fro became quite annoying&lt;/em>)
&lt;ul>
&lt;li>&lt;strong>Proper nouns&lt;/strong>: names of specific persons or entities
&lt;ul>
&lt;li>E.g.: &lt;em>Regina, Colorado, IBM&lt;/em>&lt;/li>
&lt;li>Generally NOT preceeded by articles
&lt;ul>
&lt;li>&lt;em>Regina is upstairs&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Usually capitalized&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Common nouns&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Count nouns&lt;/strong>
&lt;ul>
&lt;li>Allow grammatical enumeration, occurring in both the singular and plural
&lt;ul>
&lt;li>&lt;em>goat/goats, relationship/relationships&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Can be counted
&lt;ul>
&lt;li>&lt;em>one goat, two goats&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Singular count nouns can NOT appear without articles
&lt;ul>
&lt;li>&lt;del>&lt;em>Goat is white&lt;/em>&lt;/del> ❌&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Mass nouns&lt;/strong>
&lt;ul>
&lt;li>something is conceptualized as a homogeneous group&lt;/li>
&lt;li>Can NOT be counted
&lt;ul>
&lt;li>&lt;em>snow, salt&lt;/em>, and &lt;em>communism&lt;/em>&lt;/li>
&lt;li>&lt;em>&lt;del>two snows&lt;/del>&lt;/em> ❌&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Can appear without articles
&lt;ul>
&lt;li>&lt;em>Snow is white&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Verbs&lt;/strong>
&lt;ul>
&lt;li>Refer to actions and processes, including main verbs (&lt;em>draw, provide, go&lt;/em>)&lt;/li>
&lt;li>Have infections
&lt;ul>
&lt;li>non-third-person-sg (&lt;em>eat&lt;/em>)&lt;/li>
&lt;li>third-person-sg (&lt;em>eats&lt;/em>)&lt;/li>
&lt;li>progressive (&lt;em>eating&lt;/em>)&lt;/li>
&lt;li>past participle (&lt;em>eaten&lt;/em>)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Adjectives&lt;/strong>
&lt;ul>
&lt;li>Includes many terms for properties or qualities&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Adverbs&lt;/strong>: can be viewes as modifying something (often verbs)
&lt;ul>
&lt;li>Type:
&lt;ul>
&lt;li>&lt;strong>Directional/locative adverbs&lt;/strong>: specify the direction or location of some action
&lt;ul>
&lt;li>&lt;em>home&lt;/em>, &lt;em>here&lt;/em>, &lt;em>downhill&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Degree adverbs&lt;/strong>: specify the extent of some action. process, or property
&lt;ul>
&lt;li>&lt;em>extremely&lt;/em>, &lt;em>very&lt;/em>, &lt;em>somewhat&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Manner adverbs&lt;/strong>: describe the manner of some action or process
&lt;ul>
&lt;li>&lt;em>slowly&lt;/em>, &lt;em>slinkily&lt;/em>, &lt;em>delicately&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Temporal adverbs&lt;/strong>: describe the time that some ac- tion or event took place
&lt;ul>
&lt;li>&lt;em>yesterday&lt;/em>, &lt;em>Monday&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Some adverbs (e.g., temporal adverbs like &lt;em>Monday&lt;/em>) are tagged in some tagging schemes as nouns.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Many words of more or less unique function&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>interjections&lt;/strong> (&lt;em>oh, hey, alas, uh, um&lt;/em>),&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>negatives&lt;/strong> (&lt;em>no, not&lt;/em>),&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>politeness markers&lt;/strong> (&lt;em>please, thank you&lt;/em>),&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>greetings&lt;/strong> (&lt;em>hello, goodbye&lt;/em>),&lt;/p>
&lt;/li>
&lt;li>
&lt;p>existential &lt;strong>there&lt;/strong> (&lt;em>there are two on the table&lt;/em>)&lt;/p>
&lt;p>These classes may be distinguished or lumped together as interjections or adverbs depending on the purpose of the labeling.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="summary">Summary&lt;/h3>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-23%2015.40.32.png" alt="截屏2020-05-23 15.40.32" style="zoom: 40%;" />
&lt;h2 id="the-penn-treebank-part-of-speech-tagset">The Penn Treebank Part-of-Speech Tagset&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>45-tag Penn Treebank tagset (Marcus et al., 1993)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Parts of speech are generally represented by placing the tag after each word, delimited by a slash&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-05-23%2015.42.40.png"
alt="Penn Treebank Part-os-Speech Tagset">&lt;figcaption>
&lt;p>Penn Treebank Part-os-Speech Tagset&lt;/p>
&lt;/figcaption>
&lt;/figure>
&lt;/li>
&lt;/ul>
&lt;p>Example:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>There/EX are/VBP 70/CD children/NNS there/RB&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Preliminary/JJ findings/NNS were/VBD reported/VBN in/IN today/NN&lt;/p>
&lt;p>’s/POS New/NNP England/NNP Journal/NNP of/IN Medicine/NNP ./.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="tagged-corpora">Tagged corpora&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Brown&lt;/strong> corpus&lt;/li>
&lt;li>&lt;strong>WSJ&lt;/strong> corpus&lt;/li>
&lt;li>&lt;strong>Switchboard&lt;/strong> corpus&lt;/li>
&lt;/ul>
&lt;h2 id="part-of-speech-tagging">Part-of-Speech Tagging&lt;/h2>
&lt;p>&lt;strong>Part-of-speech tagging&lt;/strong>: process of assigning a part-of-speech marker to each word tokens in an input text&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Input to a tagging algorithm: a sequence of (tokenized) word in an input text. words and a tagset,&lt;/p>
&lt;/li>
&lt;li>
&lt;p>output: a sequence of tags, one per token.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Tagging&lt;/strong>: disambiguation task&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Words are ambiguous: have more than one possible part-of-speech 😧&lt;/p>
&lt;/li>
&lt;li>
&lt;p>🎯 Goal: find the correct tag for the situation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>E.g.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;em>book&lt;/em> can be a verb (&lt;em>&lt;u>book&lt;/u> that flight&lt;/em>) or a noun (&lt;em>hand me that &lt;u>book&lt;/u>&lt;/em>)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>That&lt;/em> can be a determiner (Does that flight serve dinner) or a complementizer (&lt;em>I&lt;/em>&lt;/p>
&lt;p>thought that your flight was earlier).&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>The goal of POS-tagging is to resolve these ambiguities, choosing the proper tag for the context.&lt;/strong> 💪&lt;/p>
&lt;p>Most ambiguous frequent words are &lt;em>that&lt;/em>, &lt;em>back&lt;/em>, &lt;em>down&lt;/em>, &lt;em>put&lt;/em> and &lt;em>set&lt;/em>.&lt;/p>
&lt;p>E.g., 6 POS for the word &lt;em>back&lt;/em>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>earnings growth took a &lt;strong>back/JJ&lt;/strong> seat&lt;/p>
&lt;/li>
&lt;li>
&lt;p>a small building in the &lt;strong>back/NN&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>a clear majority of senators &lt;strong>back/VBP&lt;/strong> the bill&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Dave began to &lt;strong>back/VB&lt;/strong> toward the door&lt;/p>
&lt;/li>
&lt;li>
&lt;p>enable the country to buy &lt;strong>back/RP&lt;/strong> about debt&lt;/p>
&lt;/li>
&lt;li>
&lt;p>I was twenty-one &lt;strong>back/RB&lt;/strong> then&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Nonetheless, many words are easy to disambiguate, because their different tags aren’t equally likely. This idea suggests a simplistic baseline algorithm for part-of-speech tagging: &lt;strong>given an ambiguous word, choose the tag which is most frequent in the training corpus.&lt;/strong> 💡&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Most Frequent Class Baseline&lt;/strong>: Always compare a classifier against a baseline at least as good as the most frequent class baseline (assigning each token to the class it occurred in most often in the training set).&lt;/p>
&lt;/blockquote>
&lt;p>&lt;strong>How good is this baseline?&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Standard way to measure the performance of part- of-speech taggers: &lt;strong>accuracy&lt;/strong>
&lt;ul>
&lt;li>the percentage of tags correctly labeled (matching human labels on a test set)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>train on the WSJ training corpus and test on sections 22-24 of the same corpus
&lt;ul>
&lt;li>the most-frequent-tag baseline achieves an accuracy of 92.34%.&lt;/li>
&lt;li>the state of the art in part-of-speech tagging on this dataset is around 97% tag accuracy&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>HMM Part-of-Speech Tagging</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/pos-tagging/hmm-pos-tagging/</link><pubDate>Mon, 03 Aug 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/pos-tagging/hmm-pos-tagging/</guid><description>&lt;p>&lt;strong>Sequence model/classifier&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Assign a label or class to each unit in a sequence&lt;/p>
&lt;/li>
&lt;li>
&lt;p>mapping a sequence of observation to a sequence of labels&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Hidden Markov Model (HMM)&lt;/strong> is a probabilistic sequence model&lt;/p>
&lt;ul>
&lt;li>
&lt;p>given a sequence of units (words, letters, morphemes, sentences, whatever)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>it computes a probability distribution over possible sequences of labels and chooses the best label sequence&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="markov-chains">Markov Chains&lt;/h2>
&lt;p>A &lt;strong>Markov chain&lt;/strong> is a model that tells us something about the probabilities of sequences of random variables, &lt;em>&lt;strong>states&lt;/strong>&lt;/em>, each of which can take on values from some set. These sets can be words, or tags, or symbols representing anything (E.g., &lt;em>the weather&lt;/em>).&lt;/p>
&lt;p>💡 A Markov chain makes a very strong assumption that&lt;/p>
&lt;ul>
&lt;li>
&lt;p>if we want to predict the future in the sequence, &lt;strong>all that matters is the current state&lt;/strong>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>All the states before the current state have NO impact on the future except via the current state.&lt;/p>
&lt;ul>
&lt;li>&lt;em>It’s as if to predict tomorrow’s weather you could examine today’s weather but you weren’t allowed to look at yesterday’s weather.&lt;/em>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>👆 &lt;strong>Markov assumption&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Consider a sequence of state variables $q_1, q_2, \dots, q_i$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>When predicting the future, the past does NOT matter, only the present
&lt;/p>
$$
P\left(q_{i}=a | q_{1} \ldots q_{i-1}\right)=P\left(q_{i}=a | q_{i-1}\right)
$$
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Markov chain&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>embodies the Markov assumption on the probabilities of this sequence&lt;/p>
&lt;/li>
&lt;li>
&lt;p>specified by the following components&lt;/p>
&lt;ul>
&lt;li>
&lt;p>$Q = q\_1, q\_2, \dots, q\_N$: a set of $N$ &lt;strong>states&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$A=a\_{11} a\_{12} \dots a\_{n 1} \dots a\_{n n}$: &lt;strong>transition probability matrix&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Each $a\_{ij}$ represents the probability of moving from state $i$ to state $j$, s.t.
$$
\sum\_{j=1}^{n} a\_{i j}=1 \quad \forall i
$$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>$\pi=\pi\_{1}, \pi\_{2}, \dots, \pi\_{N}$: an &lt;strong>initial probability distribution&lt;/strong> over states&lt;/p>
&lt;ul>
&lt;li>
&lt;p>$\pi_i$: probability that the Markov chain will start in state $i$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Some states $j$ may have $\pi_j = 0$ (meaning that the can NOT be initial states)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\displaystyle\sum\_{i=1}^{n} \pi\_{i}=1$&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Example&lt;/strong>&lt;/p>
&lt;p>​ &lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%E6%88%AA%E5%B1%8F2020-05-23%2018.01.29-20200803153051217.png" alt="截屏2020-05-23 18.01.29">&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Nodes: states&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Edges: transitions, with their probabilities&lt;/p>
&lt;ul>
&lt;li>The values of arcs leaving a given state must sum to 1.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Setting start distribution $\pi = [0.1, 0.7, 0.2]$ would mean a probability 0.7 of starting in state 2 (cold), probability 0.1 of starting in state 1 (hot), and 0.2 of starting in state 3 (warm)&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="hidden-markov-model-hmm">Hidden Markov Model (HMM)&lt;/h2>
&lt;p>A Markov chain is useful when we need to compute a probability for a sequence of observable events.&lt;/p>
&lt;p>In many cases, however, the events we are interested in are &lt;strong>hidden&lt;/strong>: we don’t observe them directly.&lt;/p>
&lt;ul>
&lt;li>We do NOT normally observe POS tags in a text&lt;/li>
&lt;li>Rather, we see words, and must infer the tags from the word sequence&lt;/li>
&lt;/ul>
&lt;p>$\Rightarrow$ We call the tags &lt;strong>hidden&lt;/strong> because they are NOT observed.&lt;/p>
&lt;p>&lt;strong>Hidden Markov Model (HMM)&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>allows us to talk about both &lt;em>observed&lt;/em> events (like words that we see in the input) and &lt;em>hidden&lt;/em> events (like part-of-speech tags) that we think of as causal factors in our probabilistic model&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Specified by:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>$Q = {q\_1, q\_2, \dots, q\_N}$: a set of $N$ &lt;strong>states&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$A=a\_{11} a\_{12} \dots a\_{n 1} \dots a\_{n n}$: &lt;strong>transition probability matrix&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Each $a_{ij}$ represents the probability of moving from state $i$ to state $j$, s.t.
$$
\sum\_{j=1}^{n} a\_{i j}=1 \quad \forall i
$$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>$O = {o\_1, o\_2, \dots, o\_T}$: a set of $T$ &lt;strong>observations&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Each one drawn from a vocabulary $V = {v_1, v_2, \dots, v_V}$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>$B=b\_{i}\left(o\_{t}\right)$: a sequence of &lt;strong>observation likelihoods&lt;/strong> (also called &lt;strong>emission probabilities&lt;/strong>)&lt;/p>
&lt;ul>
&lt;li>Each expressing the probability of an observation $o\_t$ being generated from a state $q_i$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>$\pi=\pi\_{1}, \pi\_{2}, \dots, \pi\_{N}$: an &lt;strong>initial probability distribution&lt;/strong> over states&lt;/p>
&lt;ul>
&lt;li>
&lt;p>$\pi\_i$: probability that the Markov chain will start in state $i$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Some states $j$ may have $\pi\_j = 0$ (meaning that the can NOT be initial states)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>$\displaystyle\sum_{i=1}^{n} \pi_{i}=1$&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>A first-order hidden Markov model instantiates two simplifying assumptions&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Markov assumption&lt;/strong>: the probability of a particular state depends only on the previous state
&lt;/p>
$$
P\left(q\_{i}=a | q\_{1} \ldots q\_{i-1}\right)=P\left(q\_{i}=a | q\_{i-1}\right)
$$
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Output independence&lt;/strong>: the probability of an output observation $o_i$ depends only on the state that produced the observation $q_i$ and NOT on any other states or any other observations
&lt;/p>
$$
P\left(o_{i} | q\_{1} \ldots q\_{i}, \ldots, q\_{T}, o\_{1}, \ldots, o\_{i}, \ldots, o\_{T}\right)=P\left(o\_{i} | q\_{i}\right)
$$
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="components-of-hmm-tagger">Components of HMM tagger&lt;/h2>
&lt;p>An HMM has two components, the $A$ and $B$ probabilities&lt;/p>
&lt;h3 id="the-a-probabilities">The A probabilities&lt;/h3>
&lt;p>The $A$ matrix contains the tag transition probabilities $P(t\_i | t\_{i-1})$ which represent the probability of a tag occurring given the previous tag.&lt;/p>
&lt;ul>
&lt;li>E.g., modal verbs like &lt;em>will&lt;/em> are very likely to be followed by a verb in the base form, a VB, like &lt;em>race&lt;/em>, so we expect this probability to be high.&lt;/li>
&lt;/ul>
&lt;p>We compute the maximum likelihood estimate of this transition probability by &lt;strong>counting&lt;/strong>, out of the times we see the first tag in a labeled corpus, how often the first tag is followed by the second
&lt;/p>
$$
P\left(t\_{i} | t\_{i-1}\right)=\frac{C\left(t\_{i-1}, t\_{i}\right)}{C\left(t\_{i-1}\right)}
$$
&lt;ul>
&lt;li>For example, in the WSJ corpus, MD occurs 13124 times of which it is followed by VB 10471. Therefore, for an MLE estimate of
$$
P(V B | M D)=\frac{C(M D, V B)}{C(M D)}=\frac{10471}{13124}=.80
$$&lt;/li>
&lt;/ul>
&lt;h3 id="the-b-probabilities">The B probabilities&lt;/h3>
&lt;p>The $B$ emission probabilities, $P(w_i|t_i)$, represent the probability, given a tag (say MD), that it will be associated with a given word (say &lt;em>will&lt;/em>). The MLE of the emission probability is
&lt;/p>
$$
P\left(w\_{i} | t\_{i}\right)=\frac{C\left(t\_{i}, w\_{i}\right)}{C\left(t\_{i}\right)}
$$
&lt;ul>
&lt;li>E.g.: Of the 13124 occurrences of MD in the WSJ corpus, it is associated with &lt;em>will&lt;/em> 4046 times
$$
P(w i l l | M D)=\frac{C(M D, w i l l)}{C(M D)}=\frac{4046}{13124}=.31
$$&lt;/li>
&lt;/ul>
&lt;p>Note that this likelihood term is NOT asking “which is the most likely tag for the word &lt;em>will&lt;/em>?” That would be the posterior $P(\text{MD}|\text{will})$. Instead, $P(\text{will}|\text{MD})$ answers the question “If we were going to generate a MD, how likely is it that this modal would be &lt;em>will&lt;/em>?”&lt;/p>
&lt;p>Example: three states HMM POS tagger&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-23%2022.17.43.png" alt="截屏2020-05-23 22.17.43" style="zoom:50%;" />
&lt;h2 id="hmm-tagging-as-decoding">HMM tagging as decoding&lt;/h2>
&lt;p>&lt;strong>Decoding&lt;/strong>: Given as input an HMM $\lambda = (A, B)$ and a sequence of observations $O = o_1, o_2, \dots,o_T$, find the most probable sequence of states $Q = q\_1q\_2 \dots q\_T$&lt;/p>
&lt;p>🎯 For part-of-speech tagging, the goal of HMM decoding is to choose the tag sequence $t\_1^n$ that is most probable given the observation sequence of $n$ words $w\_1^n$&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200803153311832.png" alt="image-20200803153311832" style="zoom:18%;" />
&lt;p>HMM taggers make two simplifying assumptions:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>the probability of a word appearing depends only on its own tag and is independent of neighboring words and tags
&lt;/p>
$$
P\left(w\_{1}^{n} | t\_{1}^{n}\right) \approx \prod\_{i=1}^{n} P\left(w\_{i} | t\_{i}\right)
$$
&lt;/li>
&lt;li>
&lt;p>the probability of a tag is dependent only on the previous tag, rather than the entire tag sequence (the &lt;strong>bigram&lt;/strong> assumption)
&lt;/p>
$$
P\left(t\_{1}^{n}\right) \approx \prod\_{i=1}^{n} P\left(t\_{i} | t\_{i-1}\right)
$$
&lt;/li>
&lt;/ul>
&lt;p>Combing these two assumptions, the most probable tag sequence from a bigram tagger is:
&lt;/p>
$$
\hat{t}\_{1}^{n}=\underset{t\_{1}^{n}}{\operatorname{argmax}} P\left(t\_{1}^{n} | w\_{1}^{n}\right) \approx \underset{t\_{1}^{n}}{\operatorname{argmax}} \prod_{i=1}^{n} \overbrace{P\left(w\_{i} | t\_{i}\right)}^{\text {emission}} \cdot \overbrace{P\left(t\_{i} | t\_{i-1}\right)}^{\text{transition }}
$$
&lt;h2 id="the-viterbi-algorithm">The Viterbi Algorithm&lt;/h2>
&lt;p>The Viterbi algorithm:&lt;/p>
&lt;ul>
&lt;li>Decoding algorithm for HMMs&lt;/li>
&lt;li>An instance of dynamic programming&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-24%2011.45.05.png" alt="截屏2020-05-24 11.45.05" style="zoom:50%;" />
&lt;p>The Viterbi algorithm first sets up a probability matrix or &lt;strong>lattice&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>One column for each observation $o\_t$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>One row for each state in the state graph&lt;/p>
&lt;p>$\rightarrow$ Each column has a cell for each state $q\_i$ in the single combined automation&lt;/p>
&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-24%2011.49.16.png" alt="截屏2020-05-24 11.49.16" style="zoom:40%;" />
&lt;p>Each cell of the lattice $v\_t(j)$&lt;/p>
&lt;ul>
&lt;li>
&lt;p>represents the probability that the HMM is in state $j$ after seeing the first $t$ observatins and passing through the most probable state sequence $q\_1, \dots, q\_{t-1}$, given the HMM $\lambda$&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The value of each cell $v\_t(j)$ is computed by recursively taking &lt;strong>the most probable&lt;/strong> path that could lead us to this cell
&lt;/p>
$$
v\_{t}(j)=\max \_{q_{1}, \ldots, q\_{t-1}} P\left(q\_{1} \ldots q\_{t-1}, o\_{1}, o\_{2} \ldots o\_{t}, q\_{t}=j | \lambda\right)
$$
&lt;ul>
&lt;li>
&lt;p>Represent the most probable path by taking the maximum over all possible previous state sequences $\underset{{q\_{1}, \ldots, q\_{t-1}}}{\max}$&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Viterbi fills each cell recursively (like other dynamic programming algorithms)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Given that we had already computed the probability of being in every state at time $t-1$, we compute the Viterbi probability by taking the most probable of the extensions of the paths that lead to the current cell.&lt;/p>
&lt;p>For a given state $q_j$ at time $t$, the value $v_t(j)$ is computed as
&lt;/p>
$$
v\_{t}(j)=\max \_{i=1}^{N} v\_{t-1}(i) a\_{i j} b\_{j}\left(o\_{t}\right)
$$
&lt;ul>
&lt;li>$v\_{t-1}(i)$: the &lt;strong>previous Viterbi path&lt;/strong> probability from the previous time step&lt;/li>
&lt;li>$a\_{ij}$: the &lt;strong>transition probability&lt;/strong> from previous state $q_i$ to current state $q_j$&lt;/li>
&lt;li>$b\_j(o\_t)$: the &lt;strong>state observation likelihood&lt;/strong> of the observation symbol $o_t$ given the current state $j$&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="example-1">Example 1&lt;/h3>
&lt;p>Tag the sentence &amp;ldquo;&lt;em>Janet will back the bill&lt;/em>&amp;rdquo;&lt;/p>
&lt;p>🎯 Goal: correct series of tags (&lt;strong>Janet/NNP will/MD back/VB the/DT bill/NN&lt;/strong>)&lt;/p>
&lt;p>HMM is defiend by two tables&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-24%2012.06.37.png" alt="截屏2020-05-24 12.06.37" style="zoom:50%;" />
&lt;ul>
&lt;li>👆 Lists the $a\_{ij}$ probabilities for transitioning betweeen the hidden states (POS tags)&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-24%2012.08.01.png" alt="截屏2020-05-24 12.08.01" style="zoom:50%;" />
&lt;ul>
&lt;li>👆 Expresses the $b\_i(o\_t)$ probabilities, the &lt;em>observation&lt;/em> likelihood s of words given tags
&lt;ul>
&lt;li>This table is (slightly simplified) from counts in WSJ corpus&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Computation:&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-24%2012.10.53.png" alt="截屏2020-05-24 12.10.53" style="zoom:40%;" />
&lt;ul>
&lt;li>There&amp;rsquo;re $N=5$ state columns&lt;/li>
&lt;li>begin in column 1 (for the word &lt;em>Janet&lt;/em>) by setting the Viterbi value in each cell to the product of
&lt;ul>
&lt;li>the $\pi$ transistion probability (the start probability for that state $i$, which we get from the &amp;lt;&lt;em>s&lt;/em>&amp;gt; entry), and&lt;/li>
&lt;li>the observation likelihood of the word &lt;em>Janet&lt;/em> givne the tag for that cell
&lt;ul>
&lt;li>Most of the cells in the column are zero since the word &lt;em>Janet&lt;/em> cannot be any of those tags (See Figure 8.8 above)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Next, each cell in the &lt;em>will&lt;/em> column gets update
&lt;ul>
&lt;li>For each state, we compute the value $viterbi[s, t]$ by taking the maximum over the extensions of all the paths from the previous column that lead to the current cell&lt;/li>
&lt;li>Each cell gets the max of the 7 values from the previous column, multiplied by the appropriate transition probability
&lt;ul>
&lt;li>Most of them are zero from the previous column in this case&lt;/li>
&lt;li>The remaining value is multiplied by the relevant observation probability, and the (trivial) max is taken. (In this case the final value, 2.772e-8, comes from the NNP state at the previous column. )&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="example-2">Example 2&lt;/h3>
&lt;p>&lt;a href="https://www.cis.upenn.edu/~cis262/notes/Example-Viterbi-DNA.pdf">HMM : Viterbi algorithm - a toy example&lt;/a> 👍&lt;/p>
&lt;h2 id="extending-the-hmm-algorithm-to-trigrams">Extending the HMM Algorithm to Trigrams&lt;/h2>
&lt;p>In simple HMM model described above, the probability of a tag depends only on the previous tag
&lt;/p>
$$
P\left(t_{1}^{n}\right) \approx \prod_{i=1}^{n} P\left(\left.t_{i}\right|_{i-1}\right)
$$
&lt;p>
In practice we use more of the history, letting the probability of a tag depend on the &lt;strong>two&lt;/strong> previous tags
&lt;/p>
$$
P\left(t_{1}^{n}\right) \approx \prod_{i=1}^{n} P\left(t_{i} | t_{i-1}, t_{i-2}\right)
$$
&lt;ul>
&lt;li>Small increase in performance (perhaps a half point)&lt;/li>
&lt;li>But conditioning on two previous tags instead of one requires a significant change to the Viterbi algorithm 🤪
&lt;ul>
&lt;li>For each cell, instead of taking a max over transitions from each cell in the previous column, we have to take a max over paths through the cells in the previous two columns&lt;/li>
&lt;li>thus considering $N^2$ rather than $N$ hidden states at every observation.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>In addition to increasing the context window, HMM taggers have a number of other advanced features&lt;/p>
&lt;ul>
&lt;li>
&lt;p>let the tagger know the location of the end of the sentence by &lt;span style="color:blue">adding dependence on an end-of-sequence marker &lt;/span>for $t_{n+1}$
&lt;/p>
$$
\hat{t}_{1}^{n}=\underset{t_{1}^{n}}{\operatorname{argmax}} P\left(t_{1}^{n} | w_{1}^{n}\right) \approx \underset{t_{1}^{n}}{\operatorname{argmax}}\left[\prod_{i=1}^{n} P\left(w_{i} | t_{i}\right) P\left(t_{i} | t_{i-1}, t_{i-2}\right)\right] \color{blue}{P\left(t_{n+1} | t_{n}\right)}
$$
&lt;ul>
&lt;li>Three of the tags ($t_{-1}, t_0, t_{n+1}$) used in the context will fall off the edge of the sentence, and hence will not match regular words
&lt;ul>
&lt;li>These tags can all be set to be a single special ‘sentence boundary’ tag that is added to the tagset, which assumes sentences boundaries have already been marked.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>🔴 Problem with trigram taggers: &lt;span style="color:red">data sparsity&lt;/span>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Any particular sequence of tags $t_{i-2}, t_{i-1}, t_{i}$ that occurs in the test set may simply never have occurred in the training set.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Therefore we can NOT compute the tag trigram probability just by the maximum likelihood estimate from counts, following
&lt;/p>
$$
P\left(t_{i} | t_{i-1}, t_{i-2}\right)=\frac{C\left(t_{i-2}, t_{i-1}, t_{i}\right)}{C\left(t_{i-2}, t_{i-1}\right)}
$$
&lt;ul>
&lt;li>Many of these counts will be zero in any training set, and we will incorrectly predict that a given tag sequence will never occur!&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>We need a way to estimate $P(t_i|t_{i-1}, t_{i-2})$ even if the sequence $t_{i-2}, t_{i-1}, t_i$ never occurs in the training data&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Standard approach: estimate the probability by combining more robust, but weaker estimators.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>E.g., if we’ve never seen the tag sequence PRP VB TO, and so can’t compute $P(\mathrm{TO} | \mathrm{PRP}, \mathrm{VB})$ from this frequency, we still could rely on the bigram probability $P(\mathrm{TO} | \mathrm{VB})$, or even the unigram probability $P(\mathrm{TO})$.&lt;/p>
&lt;p>The maximum likelihood estimation of each of these probabilities can be computed from a corpus with the following counts:
&lt;/p>
$$
\begin{aligned}
\text { Trigrams } \qquad \hat{P}\left(t_{i} | t_{i-1}, t_{i-2}\right) &amp;=\frac{C\left(t_{i-2}, t_{i-1}, t_{i}\right)}{C\left(t_{i-2}, t_{i-1}\right)} \\\\
\text { Bigrams } \qquad \hat{P}\left(t_{i} | t_{i-1}\right) &amp;=\frac{C\left(t_{i-1}, t_{i}\right)}{C\left(t_{i-1}\right)} \\\\
\text { Unigrams } \qquad \hat{P}\left(t_{i}\right)&amp;=\frac{C\left(t_{i}\right)}{N}
\end{aligned}
$$
&lt;p>
We use &lt;strong>linear interpolation&lt;/strong> to combine these three estimators to estimate the trigram probability $P\left(t_{i} | t_{i-1}, t_{i-2}\right)$. We estimate the probablity $P\left(t_{i} | t_{i-1}, t_{i-2}\right)$ by a &lt;strong>weighted sum&lt;/strong> of the unigram, bigram, and trigram probabilities
&lt;/p>
$$
P\left(t_{i} | t_{i-1} t_{i-2}\right)=\lambda_{3} \hat{P}\left(\left.t_{i}\right|_{i-1} t_{i-2}\right)+\lambda_{2} \hat{P}\left(t_{i} | t_{i-1}\right)+\lambda_{1} \hat{P}\left(t_{i}\right)
$$
&lt;ul>
&lt;li>$\lambda_1 + \lambda_2 + \lambda_3 = 1$&lt;/li>
&lt;li>$\lambda$s are set by &lt;strong>deleted interpolation&lt;/strong>
&lt;ul>
&lt;li>successively delete each trigram from the training corpus and choose the λs so as to maximize the likelihood of the rest of the corpus.
&lt;ul>
&lt;li>helps to set the λs in such a way as to generalize to unseen data and not overfit 👍&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-26%2011.34.37.png" alt="截屏2020-05-26 11.34.37" style="zoom: 70%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="beam-search">Beam Search&lt;/h2>
&lt;p>&lt;span style="color:red">Problem of vanilla Viterbi algorithms&lt;/span>&lt;/p>
&lt;ul>
&lt;li>Slow, when the number of states grows very large&lt;/li>
&lt;li>Complexity: $O(N^2T)$
&lt;ul>
&lt;li>$N$: Number of states
&lt;ul>
&lt;li>Can be large for trigram taggers
&lt;ul>
&lt;li>E.g.: Considering very previous pair of the 45 tags resulting in $45^3=91125$ computations per column!!! 😱&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>can be even larger for other applications of Viterbi (E.g., decoding in neural networks)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>🔧 Common solution: &lt;strong>beam search decoding&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>💡 Instead of keeping the entire column of states at each time point $t$, we just keep the best few hypothesis at that point.&lt;/p>
&lt;p>At time $t$:&lt;/p>
&lt;ol>
&lt;li>Compute the Viterbi score for each of the $N$ cells&lt;/li>
&lt;li>Sort the scores&lt;/li>
&lt;li>Keep only the best-scoring states. The rest are pruned out and NOT continued forward to time $t+1$&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>
&lt;p>Implementation&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Keep a fixed number of states (beam width) $\beta$ instead of all $N$ current states&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Alternatively $\beta$ can be modeled as a fixed percentage of the $N$ states, or as a probability threshold&lt;/p>
&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2020-05-26%2011.54.17-20200803152950300-20200803153020039.png" alt="img" style="zoom:70%;" />
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="unknown-words">Unknown Words&lt;/h2>
&lt;p>One useful feature for distinguishing parts of speech is &lt;strong>word shape&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>words starting with capital letters are likely to be proper nouns (NNP).&lt;/li>
&lt;/ul>
&lt;p>Strongest source of information for guessing the part-of-speech of unknown words: &lt;strong>morphology&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Words ending in &lt;em>-s&lt;/em> $\Rightarrow$ plural nouns (NNS)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>words ending with &lt;em>-ed&lt;/em> $\Rightarrow$ past participles (VBN)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>words ending with &lt;em>-able&lt;/em> $\Rightarrow$ adjectives (JJ)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&amp;hellip;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>We store for each &lt;em>suffixes&lt;/em> of up to 10 letters the statistics of the tag it was associated with in training. We are thus computing for each suffix of length $i$ the probability of the tag $t_i$ given the suffix letters
&lt;/p>
$$
P\left(t_{i} | l_{n-i+1} \ldots l_{n}\right)
$$
&lt;p>&lt;strong>Back-off&lt;/strong> is used to smooth these probabilities with successively shorter suffixes.&lt;/p>
&lt;p>Unknown words are unlikely to be closed-class words (like prepositions), suffix probabilites can be computed only for words whose training set frequency is $\leq 10$, or only for open-class words.&lt;/p>
&lt;p>As $P\left(t_{i} | l_{n-i+1} \ldots l_{n}\right)$ gives a posterior estimate $p(t_i|w_i)$, we can compute the likelihood $p(w_i|t_i)$ tha HMMs require by using Bayesian inversion (i.e., using Bayes’ rule and computation of the two priors $P(t_i)$ and $P(t_i|l_{n-i+1}\dots l_n)$).&lt;/p>
&lt;h2 id="reference">Reference&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://web.stanford.edu/~jurafsky/slp3/8.pdf">Speech and Language Processing, ch8&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Viterbi:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://www.zhihu.com/question/20136144">https://www.zhihu.com/question/20136144&lt;/a>&lt;/li>
&lt;li>Example: &lt;a href="https://www.cis.upenn.edu/~cis262/notes/Example-Viterbi-DNA.pdf">HMM : Viterbi algorithm - a toy example&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item></channel></rss>