06-Summarization

Summarization

TL;DR

ๆˆชๅฑ2020-09-16 23.43.02

Introduction

What is Summarization?

  • Reduce natural language text document
  • Goal: Compress text by extracting the most important/relevant parts ๐Ÿ’ช

Applications

  • Articles, news: Outlines or abstracts

  • Email / Email threads

  • Health information

  • Meeting summarization

Dimensions

  • Single vs. multiple

    • Single-document summarization
      • Given single document
      • Produce abstract, outline, headline, etc.
    • Multiple-document summarization
      • Given a group of documents

      • A series of news stories on the same event

      • A set of web pages about some topic or question

  • Generic vs. Query-focused summarization

    • Generic summarization: summarize the content of the document
    • Query-focused summarization: kind of complex question answering ๐Ÿคช
      • Summarize a document with respect to an information need expressed in a user query
      • Longer, descriptive, more informative answers
      • Answer a question by summarizing a document that has information to construct the answer

Techniques

Extraction

  • Select subset of existing text segments
  • e.g.:
    • Sentence extraction

    • Key-phrase extraction

  • Simpler, most focus in research

Abstraction

  • Use natural language generation to create summary
  • More human like

Extractive summarization

Three main components

  • Content selection ("Which parts are important to be in the summary?")
  • Information ordering ("How to order summaries?")
  • Sentence realization (Clean up/Simplify sentences)

Supervised approaches

Key-word extraction

  • Given: Text (e.g. abstract of an article, โ€ฆ)

  • Task: Find most important key phrases

    ComputerHuman
    Select key phrases from the textAbstraction of the text
    No new wordingsNew words
Key-phrase extraction using Supervised approaches
  • Given: Collection of text with key-words

  • Algorithm

    • Extract all uni-grams, bi-grams and tri-grams

      • Example

        ๆˆชๅฑ2020-09-17 23.08.38

        Extraction: Compatibility, Compatibility of, Compatibility of systems, of, of systems, of systems of, systems, systems of, systems of linear, of linear, of linear constraints, linear, linear constraints, linear constraints over, โ€ฆ

    • Annotate each examples with features

    • Annotate training examples with class:

      • 1 if sequence is part of the key words
      • 0 if sequence is not part of the key words
    • Train classifier

    • Create test examples and classify

  • Examples set

    • All uni-, bi-, and trigrams (except punctuation)
    • restrict to certain POS sets
  • ๐Ÿ”ด Problem:

    • Enough examples to generate all/most key phrases

    • Too many examples -> low performance of classifier

  • Features

    • Term frequency

    • TF-IDF (Term Frequencyโ€“Inverse Document Frequency)

      • Reflect importance of a word in a document

        TF-IDF=tfโˆ—idf \text{TF-IDF} = tf * idf
        • tf(w,D)tf(w, D):

          • Term Frequency, measures how frequently a term occurs in a document.
          • Number of occurrences of word ww in document dd divided by the maximum frequency of one word in DD
          t f(w, D)=\\#(w, D) \frac{\\#(w, D)}{\max\_{w^{\prime} \in D}\left(w^{\prime}, D\right)}

          Alternative definition:

          >tf(w,D)=count of w in Dnumber of words in D> > tf(w, D) = \frac{\text{count of } w \text{ in } D}{\text{number of words in } D} >
        • idf(w)idf(w):

          • Inverse Document Frequency, measures how important a term is
          • Idea: Words which occur in less documents are more important
          • Number of documents divided by the number of documents which contain ww
          idf(w)=logโกโˆฃDโˆฃโˆฃdโˆˆD:wโˆˆdโˆฃ i d f(w)=\log \frac{|D|}{|\\{d \in D: w \in d\\}|}
      • Example:

        • Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
    • Length of the example

    • Relative position of the first occurrence

    • Boolean syntactic features

      • contains all caps
  • Learning algorithm

    • Decision trees,

    • Naive Bayes classifier

    • โ€ฆ

  • Evaluation

    • Compare results to reference

      • Test set: Text

      • Human generated Key words

    • Metrics:

      • Precision

      • Recall

      • F-Score

    • ๐Ÿ”ด Problems

      • Humans do not only extract key words, but also abstract
      • Normally not all key words are reachable

Sentence extraction

  • Use statistic heuristics to select sentences

  • Do not change content and meaning

  • ๐Ÿ’ก Idea

    • Use measure to determine importance of sentence
      • TF-IDF
      • Supervised trained combination of several features
    • Rank sentence according to metric
    • Output sentences with highest scores:
      • Fixed number
      • All sentence above threshold
  • Limitations: Do NOT change text (e.g. add phrases, delete parts of the text)

  • Evaluation

    • Idea: Compare automatic summary to abstract of text

      • Problem: Different sentences โ€“> Nearly no exact match ๐Ÿ˜ญ
    • ROUGE - Recall-Oriented Understudy for Gisting Evaluation

      • Use also approximate matches

      • Compare automatic summary to human generated text

        • Given a document D and an automatic summary X

          • M humans produce a set of reference summaries of D

          • What percentage of the n-grams from the reference summaries appear in X?

      • ROUGE-N: Overlap of N-grams between the system and reference summaries

        ROUGH-N=โˆ‘_SโˆˆReference Summariesโˆ‘_gram_nโˆˆSCountโก_match(gram_n)โˆ‘_SโˆˆReference Summariesโˆ‘_gram_nโˆˆSCountโก(gram_n) \text{ROUGH-N} = \frac{\sum\_{S \in \\{\text{Reference Summaries}\\}} \sum\_{gram\_n \in S}\operatorname{Count}\_{match}(gram\_n)}{\sum\_{S \in \\{\text{Reference Summaries}\\}} \sum\_{gram\_n \in S}\operatorname{Count}(gram\_n)}
        • Example:

          Auto-generated summary (YY)

          the cat was found under the bed
          

          Gold standard (human produced) (X1X1)

          the cat was under the bed
          

          1-gram and 2-gram summary:

          #1-gramreference 1-gram2-gramreference 2-gram
          1thethethe catthe cat
          2catcatcat wascat was
          3waswaswas foundwas under
          4foundunderfound underunder the
          5undertheunder thethe bed
          6thebedthe bed
          7bed
          count7665
          • ROUGEโกโˆ’1(X1,Y)=66=1\operatorname{ROUGE}-1(X1, Y) = \frac{6}{6} = 1
          • ROUGEโกโˆ’2(X1,Y)=45\operatorname{ROUGE}-2(X1, Y) = \frac{4}{5}

Unsupervised approaches

Problems of supervised approaches: Hard to acquire training data

We try to use unsupervised learning to find key phrases / sentences which are most important. But which sentences are most important?

๐Ÿ’ก Idea: Sentences which are most similar to the other sentences in the text

Graph-based approaches

  • Map text into a graph
    • Nodes:
      • Text segments: Words
    • Edges: Similarity
  • Find most important/central vertices
  • Algorithms: TextRank / LexRank

Graph-based approaches : Key-phrase extraction

  • Graph

    • Nodes
      • Text segments: Words
      • Restriction:
        • Nouns
        • Adjective
    • Edges
      • items co-occur in a window of N words
  • Calculate most important nodes

  • Build Multi-word expression in post-processing

    • Mark selected items in original text
    • If two adjacent words are marked โ€“> Collapse to one multi-words expression
  • Example

    ๆˆชๅฑ2020-09-18 00.43.48

Graph-based approaches : Sentence extraction

Graph:

  • Nodes: Sentences

  • Edges: Fully connected with weights

  • Weights:

    • TextRank: Word overlap normalized to sentence length

      \text {Similarity}\left(S\_{i}, S\_{j}\right)=\frac{\left|\left\\{w\_{k} \mid w\_{k} \in S\_{i} \text{ & } w\_{k} \in S\_{j}\right\\}\right|}{\log \left(\left|S\_{i}\right|\right)+\log \left(\left|S\_{j}\right|\right)}
    • LexRank: Cosine Similarity of TF-IDF vectors

       idf-modified-cosine (x,y)=โˆ‘_wโˆˆx,ytf_w,xtf_w,y(idf_w)2โˆ‘_x_iโˆˆx(tf_x_i,xidf_x_i)2ร—โˆ‘_y_iโˆˆy(tf_y_i,yidf_y_i)2 \text { idf-modified-cosine }(x, y)=\frac{\sum\_{w \in x, y} \mathrm{tf}\_{w, x} \mathrm{tf}\_{w, y}\left(\mathrm{idf}\_{w}\right)^{2}}{\sqrt{\sum\_{x\_{i} \in x}\left(\mathrm{tf}\_{x\_{i}, x} \mathrm{idf}\_{x\_{i}}\right)^{2}} \times \sqrt{\sum\_{y\_{i} \in y}\left(\mathrm{tf}\_{y\_{i}, y} \mathrm{idf}\_{y\_{i}}\right)^{2}}}

Abstract summarization

  • Sequence to Sequence task
    • Input: Document
    • Output: Summary
  • Several NLP tasks can be modeled like this (ASR, MT,โ€ฆ)
  • Successful deep learning approach: Encoder-Decoder Model

Sequence-to-Sequence Model

Image for post
  • Predict words based on
    • previous target words and
    • source sentence
  • Encoder: Read in source sentence
  • Decoder: Generate target sentence word by word

Encoder

ๆˆชๅฑ2020-09-18 10.11.37

  • Read in input: Represent content as hidden vector with fixed dimension

  • LSTM-based model

  • Fixed-size sentence representation

  • Details:

    ๆˆชๅฑ2020-09-18 10.12.59

    • Oneโ€“hot encoding
    • Word embedding
    • RNN layer(s)

Decoder

ๆˆชๅฑ2020-09-18 10.13.33

  • Generate output: Use output of encoder as input

  • LSTM-based model

  • Input last target word

Attention-based Encoder-Decoder

ๆˆชๅฑ2020-09-18 10.17.42 ๆˆชๅฑ2020-09-18 10.16.48

Attention-based Encoder : copy mechanism

Calculate probability โ€œbetter to generate one word from vocabulary than to copy a word from source sentenceโ€œ

p_gen=ฯƒ(w_cTc_t+w_sTs_t+w_xTx_t+b_ptr) p\_{g e n}=\sigma\left(w\_{c}^{T} c\_{t}+w\_{s}^{T} s\_{t}+w\_{x}^{T} x\_{t}+b\_{p t r}\right)

Word with the highest probability should be the output word

P(w)=p_genP_vocab(w)+(1โˆ’p_gen)โˆ‘_j:w_j=wฮฑ_ij P(w)=p\_{g e n} P\_{v o c a b}(w)+\left(1-p\_{g e n}\right) \sum\_{j: w\_{j}=w} \alpha\_{i j}

Data

  • Training data
    • Documents and summary
  • DUC data set
  • News article
  • Around 14 word summary
  • Giga word
    • News articles
    • Headline generation
  • CNN/Mail Corpus
  • Article
  • Predict bullet points

Reference