<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Information Extraction | Haobin Tan</title><link>https://haobin-tan.netlify.app/tags/information-extraction/</link><atom:link href="https://haobin-tan.netlify.app/tags/information-extraction/index.xml" rel="self" type="application/rss+xml"/><description>Information Extraction</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 15 Sep 2020 00:00:00 +0000</lastBuildDate><image><url>https://haobin-tan.netlify.app/media/icon_hu7d15bc7db65c8eaf7a4f66f5447d0b42_15095_512x512_fill_lanczos_center_3.png</url><title>Information Extraction</title><link>https://haobin-tan.netlify.app/tags/information-extraction/</link></image><item><title>Named-Entity Recognition</title><link>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/information-extraction/named-entity-recognition/</link><pubDate>Tue, 15 Sep 2020 00:00:00 +0000</pubDate><guid>https://haobin-tan.netlify.app/docs/ai/natural-language-processing/information-extraction/named-entity-recognition/</guid><description>&lt;h2 id="what-is-ner">What is NER?&lt;/h2>
&lt;h3 id="named-entity">&lt;strong>Named entity&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>anything that can be referred to with a proper name: a person, a location, an organization.&lt;/li>
&lt;li>commonly extended to include things that aren’t entities per se, including dates, times, and other kinds of &lt;strong>temporal expressions&lt;/strong>, and even numerical expressions like prices.&lt;/li>
&lt;/ul>
&lt;p>Sample text with the named entities marked:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2011.12.14.png" alt="截屏2020-09-16 11.12.14" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The text contains 13 mentions of named entities including&lt;/p>
&lt;ul>
&lt;li>5 organizations&lt;/li>
&lt;li>4 locations&lt;/li>
&lt;li>2 times&lt;/li>
&lt;li>1 person&lt;/li>
&lt;li>1 mention of money.&lt;/li>
&lt;/ul>
&lt;h3 id="typical-generic-named-entity-types">Typical generic named entity types&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2011.14.19.png" alt="截屏2020-09-16 11.14.19" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="named-entity-recognition">Named Entity Recognition&lt;/h3>
&lt;p>&lt;strong>Named Entity Recognition&lt;/strong>: find spans of text that constitute proper names and then classifying the type of the entity&lt;/p>
&lt;p>&lt;strong>Difficulty&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Ambiguity of segmentation&lt;/strong>: we need to decide what’s an entity and what isn’t, and where the boundaries are.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Type ambiguity&lt;/strong>: Some named entity can have many types (cross-type confusion)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Example&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2011.23.50.png" alt="截屏2020-09-16 11.23.50" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="ner-as-sequence-labeling">NER as Sequence Labeling&lt;/h2>
&lt;p>The standard algorithm for named entity recognition is as a &lt;strong>word-by-word sequence labeling task&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>The assigned tags capture both the boundary and the type.&lt;/li>
&lt;/ul>
&lt;p>A sequence classifier like an MEMM/CRF, a bi-LSTM, or a transformer is trained to label the tokens in a text with tags that indicate the presence of particular kinds of named entities.&lt;/p>
&lt;p>Consider the following simplified excerpt:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2011.26.42.png" alt="截屏2020-09-16 11.26.42" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>We represent the excerpt with &lt;strong>IOB&lt;/strong> tagging&lt;/p>
&lt;blockquote>
&lt;ul>
&lt;li>
&lt;p>In &lt;strong>IOB&lt;/strong> tagging we introduce a tag for the &lt;strong>beginning (B)&lt;/strong> and &lt;strong>inside (I)&lt;/strong> of each entity type, and one for tokens &lt;strong>outside (O)&lt;/strong> any entity. The number of tags is thus 2&lt;em>n&lt;/em> + 1 tags, where &lt;em>n&lt;/em> is the number of entity types.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>In &lt;strong>IO&lt;/strong> tagging it loses some information by eliminating the B tag. Without the B tag IO tagging is unable to distinguish between two entities of the same type that are right next to each other. Since this situation doesn’t arise very often (usually there is at least some punctuation or other deliminator), IO tagging may be sufficient, and has the advantage of using only &lt;em>n&lt;/em> + 1 tags.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/blockquote>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2011.29.13.png" alt="截屏2020-09-16 11.29.13" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="feature-based-algorithm-for-ner">Feature-based Algorithm for NER&lt;/h2>
&lt;p>💡 &lt;strong>Extract features and train an MEMM or CRF sequence model of the type like in POS&lt;/strong>.&lt;/p>
&lt;p>Standard features:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2011.34.12.png" alt="截屏2020-09-16 11.34.12" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>&lt;strong>Word shape&lt;/strong> features are particularly important in the context of NER.&lt;/p>
&lt;p>Word shape:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>represent the abstract letter pattern of the word by mapping&lt;/p>
&lt;ul>
&lt;li>lower-case letters to ‘x’,&lt;/li>
&lt;li>upper-case to ‘X’,&lt;/li>
&lt;li>numbers to ’d’,&lt;/li>
&lt;li>and retaining punctuation&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Example&lt;/p>
&lt;ul>
&lt;li>&lt;code>I.M.F&lt;/code> &amp;ndash;&amp;gt; &lt;code>X.X.X&lt;/code>&lt;/li>
&lt;li>&lt;code>DC10-30&lt;/code> &amp;ndash;&amp;gt; &lt;code>XXdd-dd&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Second class of shorter word shape:&lt;/p>
&lt;ul>
&lt;li>Consecutive character types are removed&lt;/li>
&lt;li>Example
&lt;ul>
&lt;li>&lt;code>I.M.F&lt;/code> &amp;ndash;&amp;gt; &lt;code>X.X.X&lt;/code>&lt;/li>
&lt;li>&lt;code>DC10-30&lt;/code> &amp;ndash;&amp;gt; &lt;code>Xd-d&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>For example the named entity token &lt;em>L’Occitane&lt;/em> would generate the following non-zero valued feature values:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2011.50.20.png" alt="截屏2020-09-16 11.50.20" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">Feature effectiveness depends on the application, genre, media, and language.&lt;/span>
&lt;/div>
&lt;p>The following figure illustrates the result of adding part-of-speech tags, syntactic base- phrase chunk tags, and some shape information to our earlier example.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2011.55.04.png" alt="截屏2020-09-16 11.55.04" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The following figure illustrates the operation of such a sequence labeler at the point where the token &lt;code>Corp.&lt;/code> is next to be labeled. If we assume a context window that includes the two preceding and following words, then the features available to the classifier are those shown in the boxed area.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2011.57.49.png" alt="截屏2020-09-16 11.57.49" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="neural-algorithm-for-ner">Neural Algorithm for NER&lt;/h2>
&lt;p>The standard neural algorithm for NER is based on the &lt;strong>bi-LSTM&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Word and character embeddings are computed for input word $w\_i$&lt;/li>
&lt;li>These are passed through a left-to-right LSTM and a right-to-left LSTM, whose outputs are concatenated (or otherwise combined) to produce a sin- gle output layer at position $i$.&lt;/li>
&lt;li>A CRF layer is normally used on top of the bi-LSTM output, and the Viterbi decoding algorithm is used to decode&lt;/li>
&lt;/ul>
&lt;p>The following figure shows a sketch of the algorithm:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2012.00.05.png" alt="截屏2020-09-16 12.00.05" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="rule-based-ner">Rule-based NER&lt;/h2>
&lt;p>Commercial approaches to NER are often based on &lt;strong>pragmatic combinations of lists and rules&lt;/strong>, with some smaller amount of supervised machine learning.&lt;/p>
&lt;p>One common approach is to &lt;strong>make repeated rule-based passes over a text, allowing the results of one pass to influence the next&lt;/strong>. The stages typically first involve the use of rules that have extremely high precision but low recall. Subsequent stages employ more error-prone statistical methods that take the output of the first pass into account.&lt;/p>
&lt;ol>
&lt;li>First, use high-precision rules to tag unambiguous entity mentions.&lt;/li>
&lt;li>Then, search for substring matches of the previously detected names.&lt;/li>
&lt;li>Consult application-specific name lists to identify likely name entity mentions from the given domain.&lt;/li>
&lt;li>Finally, apply probabilistic sequence labeling techniques that make use of the tags from previous stages as additional features.&lt;/li>
&lt;/ol>
&lt;p>The intuition behind this staged approach is two fold.&lt;/p>
&lt;ul>
&lt;li>First, some of the entity mentions in a text will be more clearly indicative of a given entity’s class than others.&lt;/li>
&lt;li>Second, once an unambiguous entity mention is introduced into a text, it is likely that subsequent shortened versions will refer to the same entity (and thus the same type of entity).&lt;/li>
&lt;/ul>
&lt;h2 id="evaluation-of-ner">Evaluation of NER&lt;/h2>
&lt;p>The familiar metrics of &lt;strong>recall&lt;/strong>, &lt;strong>precision&lt;/strong>, and &lt;strong>$F\_1$ measure&lt;/strong> are used to evaluate NER systems.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Recall&lt;/strong>: the ratio of the number of correctly labeled responses to the total that should have been labeled&lt;/li>
&lt;li>&lt;strong>Precision&lt;/strong>: ratio of the number of correctly labeled responses to the total labeled&lt;/li>
&lt;li>&lt;strong>&lt;em>F&lt;/em>-measure&lt;/strong>: the harmonic mean of the two.&lt;/li>
&lt;/ul>
&lt;div class="flex px-4 py-3 mb-6 rounded-md bg-primary-100 dark:bg-primary-900">
&lt;span class="pr-3 pt-1 text-primary-600 dark:text-primary-300">
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m11.25 11.25l.041-.02a.75.75 0 0 1 1.063.852l-.708 2.836a.75.75 0 0 0 1.063.853l.041-.021M21 12a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9-3.75h.008v.008H12z"/>&lt;/svg>
&lt;/span>
&lt;span class="dark:text-neutral-300">More see
.&lt;/span>
&lt;/div>
&lt;p>For named entities, the &lt;em>entity&lt;/em> rather than the word is the unit of response.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Example:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-100" >&lt;img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/%e6%88%aa%e5%b1%8f2020-09-16%2011.26.42-20200916121336851.png" alt="截屏2020-09-16 11.26.42" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The two entities &lt;code>Tim Wagner&lt;/code> and &lt;code>AMR Corp.&lt;/code> and the non-entity &lt;code>said&lt;/code> would each count as a single response.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="problem-of-evaluation">Problem of Evaluation&lt;/h3>
&lt;ul>
&lt;li>For example, a system that labeled &lt;code>American&lt;/code> but not &lt;code>American Airlines&lt;/code> as an organization would cause two errors, a false positive for O and a false negative for I-ORG&lt;/li>
&lt;li>Using entities as the unit of response but words as the unit of training means that there is a mismatch between the training and test conditions.&lt;/li>
&lt;/ul></description></item></channel></rss>