Named Entity Recognition

Introduction

Definition

Named Entity: some entity represented by a name

Named Entity Recognition: Find and classify named entities in text

ζˆͺ屏2020-05-26 18.13.30 ζˆͺ屏2020-05-26 18.13.49

Why useful?

  • Create indices & hyperlinks

  • Information extraction

    • Establish relationships between named entities, build knowledge base
  • Question answering: answers often NEs

  • Machine translation: NEs require special care

    • NEs often unknown words, usually passed through without translation.

Why difficult?

  • World knowledge

  • Non-local decisions

  • Domain specificity

  • Labeled data is very expensive

Label Representation

  • IO

    • I: Inside
    • O: Outside (indicates that a token belongs to no chunk)
  • BIO

    • B: Begin

      Wiki

      The IOB format (short for inside, outside, beginning), a synonym for BIO format, is a common tagging format for tagging tokens in a chunking task in computational linguistics.

      • B-prefix before a tag: indicates that the tag is the beginning of a chunk
      • I-prefix before a tag: indicates that the tag is the inside of a chunk
      • O tag: indicates that a token belongs to NO chunk
  • BIOES

    • E: Ending character
    • S: single element
  • BILOU

    • L: Last character
    • U: Unit length

Example:

Fred showed Sue Mengqiu Huang's new painting
ζˆͺ屏2020-05-26 18.19.51

Data

  • CoNLL03 shared task data
  • MUC7 dataset
  • Guideline examples for special cases:
    • Tokenization
    • Elision

Evaluation

ζˆͺ屏2020-05-26 22.46.58
  • Precision and Recall

    \text{Precision} = \frac{\\# \text { correct labels }}{\\# \text { hypothesized labels }} = \frac{TP}{TP + FP} \text{Recall} = \frac{\\# \text { correct labels }}{\\# \text { reference labels }} = \frac{TP}{TP + FN}
    • Phrase-level counting

      ζˆͺ屏2020-05-26 22.47.59
      • System 1:

        • β€œ\text{\\$200,000,000}" is correctly recognized as NE β‡’\Rightarrow TP =1
        • β€œFirst Bank of Chicago” is incorrectly recognised as non-NE (i.e., O) β‡’\Rightarrow FN = 1

        Therefore:

        • Precision=11+0=1\text{Precision} = \frac{1}{1 + 0} = 1

          • Recall=11+1=12\text{Recall} = \frac{1}{1 + 1} = \frac{1}{2}
      • System 2:

        • β€œ\text{\\$200,000,000}" is correctly recognized as NE β‡’\Rightarrow TP =1

        • For β€œFirst Bank of Chicago”

          WordActual labelPredicted label
          FirstORGO
          Bank of ChicagoORGORG

          There’s a boundary error (since we consider the whole phrase):

          • FN = 1
          • FP = 1

        Therefore:

        • Precision=11+1=12\text{Precision} = \frac{1}{1 + 1} = \frac{1}{2}

        • Recall=11+1=12\text{Recall} = \frac{1}{1 + 1} = \frac{1}{2}

      • Problems
        • Punish partial overlaps
        • Ignore true negatives
    • Token-level

      ζˆͺ屏2020-05-26 23.51.18

      In token-level, we consider these tokens: β€œFirst”, β€œBank”, β€œof”, β€œChicago”, and β€œ$200,000,000”

      • System 1

        • β€œ\text{\\$200,000,000}" is correctly recognized as NE β‡’\Rightarrow TP =1
        • β€œFirst”, β€œBank”, β€œof”, β€œChicago” are incorrectly recognised as non-NE (i.e., O) β‡’\Rightarrow FN = 4

        Therefore:

        • Precision=11+0=1\text{Precision} = \frac{1}{1 + 0} = 1

        • Recall=11+4=15\text{Recall} = \frac{1}{1 + 4} = \frac{1}{5}

      • Partial overlaps rewarded!
      • But
        • longer entities weighted more strongly

        • True negatives still ignored πŸ€ͺ

  • F_1F\_1 score (harmonic mean of precision and recall)

    F_1=2Γ— precision Γ— recall  precision + recall  F\_1 = \frac{2 \times \text { precision } \times \text { recall }}{\text { precision }+\text { recall }}

Text Representation

Local features

  • Previous two predictions (tri-gram feature)

    • y_iβˆ’1y\_{i-1} and y_iβˆ’2y\_{i-2}
  • Current word x_ix\_i

  • Word type x_ix\_i

    • all-capitalized, is-capitalized, all-digits, alphanumeric, …
  • Word shape

    • lower case - β€˜x’

    • upper case - β€˜X’

    • numbers - ’d'

    • retain punctuation

      ζˆͺ屏2020-05-27 10.18.48

  • Word substrings

  • Tokens in window

  • Word shapes in window

  • …

Non-local features

Identify tokens that should have same labels

Type:

  • Context aggregation

    • Derived from all words in the document

    • No dependencies on predictions, usable with any inference algorithm

  • Prediction aggregation

    • Derived from predictions of the whole document

    • Global dependencies; Inference:

      • first apply baseline without non-local features

      • then apply second system conditioned on output of first system

  • Extended prediction history

    • Condition only on past predictions –> greedy / beam search

    • πŸ’‘ Intuition: Beginning of document often easier, later in document terms often get abbreviated

Sequence Model

HMMs

  • Generative model

  • Generative story:

    • Choose document length NN

    • For each word t=0,…,Nt = 0, \dots, N:

      • Draw NE label ∼P(y_t∣y_tβˆ’1)\sim P(y\_t | y\_{t-1})

      • Draw word ∼P(x_t∣y_t)\sim P\left(x\_{t} | y\_{t}\right)

        P(y,x)=∏P(y_t∣y_tβˆ’1)P(x∣y_t) P(\mathbf{y}, \mathbf{x})=\prod P\left(y\_{t} | y\_{t-1}\right) P\left(x | y\_{t}\right)
  • Example

    ζˆͺ屏2020-09-15 22.19.02

  • πŸ‘ Pros

    • intuitive model
    • Works with unknown label sequences
    • Fast inference
  • πŸ‘Ž Cons

    • Strong limitation on textual features (conditional independence)
    • Model overly simplistic (can improve the generative story but would lose fast inference)

Max. Entropy

  • Discriminative model P(y_t∣x)P(y\_t|x)

  • Don’t care about generation process or input distribution

  • Only model conditional output probabilities

  • πŸ‘ Pros: Flexible feature design

  • πŸ‘Ž Cons: local classifier -> disregard sequence information

CRF

  • Discriminative model

  • πŸ‘ Pros:

    • Flexible feature design

    • Condition on local sequence context

    • Training as easy as MaxEnt

  • πŸ‘Ž Cons: Still no long-range dependencies possible

Modelling

Difference to POS

  • Long-range dependencies
  • Alternative resources can be very helpful
  • Several NER more than one word long

Example

ζˆͺ屏2020-05-27 10.36.36

Inference

Viterbi

  • Finds exact solution
  • Efficient algorithmic solution using dynamic programming
  • Complexity exponential in order of Markov model
    • Only feasible for small order

Greedy

  • At each timestep, choose locally best label

  • Fast, support conditioning on global history (not future)

  • No possibility for β€œlabel revision”

Beam

  • Keep a beam of the nn best greedy solutions, expand and prune
  • Limited room for label revisions

Gibbs Sampling

  • Stochastic method

  • Easy way to sample from multivariate distribution

  • Normally used to approximate joint distributions or intervals

    P(y(t)∣y(tβˆ’1))=P(y_i(t)∣y_βˆ’i(tβˆ’1),x) P\left(y^{(t)} | y^{(t-1)}\right)=P\left(y\_{i}^{(t)} | y\_{-i}^{(t-1)}, x\right)
    • βˆ’1-1 means all states except ii
  • πŸ’‘ Intuitively:

    • Sample one variable at a time, conditioned on current assignment of all other variables
    • Keep checkpoints (e.g. after each sweep through all variables) to approximate distribution
  • In our case:

    • Initialize NER tags (e.g. random or via Viterbi baseline model)

    • Re-sample one tag at a time, conditioned on input and all other tags

    • After sampling for a long time, we can estimate the joint distribution over outputs P(y∣x)P(y|x)

    • However, it’s slow, and we may only be interested in the best output πŸ€ͺ

    • Could choose best instead of sampling

      y(t)=y_βˆ’i(tβˆ’1)βˆͺargmax⁑y_i(t)(P(y_i(t)∣y_βˆ’i(tβˆ’1),x)) y^{(t)}=y\_{-i}^{(t-1)} \cup \underset{y\_{i}^{(t)}}{\operatorname{argmax}}\left(P\left(y\_{i}^{(t)} | y\_{-i}^{(t-1)}, x\right)\right)
      • will get stuck in local optima 😭
    • Better: Simulated annealing

      • Gradually move from sampling to argmax P(y(t)∣y(tβˆ’1))=P(y_i(t)∣y_βˆ’i(tβˆ’1),x)1/c_tβˆ‘_jP(y_j(t)∣y_βˆ’j(tβˆ’1),x)1/c_t P\left(y^{(t)} | y^{(t-1)}\right)=\frac{P\left(y\_{i}^{(t)} | y\_{-i}^{(t-1)}, x\right)^{1 / c\_{t}}}{\displaystyle\sum\_{j} P\left(y\_{j}^{(t)} | y\_{-j}^{(t-1)}, x\right)^{1 / c\_{t}}}

External Knowledge

Data

  • Supervised learning:

    • Label Data:
      • Text
      • NE Annotation
  • Unsupervised learning

    • Unlabeled Data: Text
    • Problem: Hard to directly learn NER
  • Semi-supervised: Labeled and Unlabeled Data

Word Clustering

  • Problem: Data Sparsity

  • Idea

    • Find lower-dimensional representation of words

    • real vector /probabilities have natural measure of similarity

  • Which words are similarr?

    • Distributional notion
    • if they appear in similar context, e.g.
      • β€œpresident” and β€œchairman” are similar
      • β€œcut” and β€œknife” not

Words in same cluster should be similar

Brown clusters

  • Bottom-up agglomerative word clustering

  • Input: Sequence of words w_1,…,w_nw\_1, \dots, w\_n

  • Output

    • binary tree
    • Cluster: subtree (according to desired #clusters)
  • πŸ’‘ Intuition: put syntacticly β€œexchangable” words in same cluster. E.g.:

    • Similar words: president/chairman, Saturday/Monday

    • Not similar: cut/knife

  • Algorithm:

    • Initialization: Every word is its own cluster

    • While there are more than one cluster

      • Merge two clusters that maximizes the quality of the clustering
  • Result:

    • Hard clustering: each word belongs to exactly one cluster
  • Quality of the clustering

    • Use class-based bigram language model

      ζˆͺ屏2020-05-27 11.09.55
    • Quality: logarithm of the probability of the training text normalized by the length of the text

       Quality (C)=1nlog⁑P(w_1,…,w_n)=1nlog⁑P(w_1,…,w_n,C(w_1),…,C(w_n))=1nlog⁑∏_i=1nP(C(w_i)∣C(w_iβˆ’1))P(w_i∣C(w_i)) \begin{aligned} \text { Quality }(C) &=\frac{1}{n} \log P\left(w\_{1}, \ldots, w\_{n}\right) \\\\ &=\frac{1}{n} \log P\left(w\_{1}, \ldots, w\_{n}, C\left(w\_{1}\right), \ldots, C\left(w\_{n}\right)\right) \\\\ &=\frac{1}{n} \log \prod\_{i=1}^{n} P\left(C\left(w\_{i}\right) | C\left(w\_{i-1}\right)\right) P\left(w\_{i} | C\left(w\_{i}\right)\right) \end{aligned}
    • Parameters: estimated using maximum-likelihood