00-Introduction

Introduction

What is NLP?

Wikipedia: Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction.

截屏2020-05-10 23.56.28

What is Dialog Modeling

  • Designing/building a spoken dialog system with its goals, user handling etc.

  • Synonymous to dialog management (DM)

    截屏2020-05-10 23.58.28
  • Examples

    • Goal-oriented dialog

    • Social dialog / Chat bot

How to do NLP?

  • Aim: Understand linguistic structure of communication
  • Idea: There are rules to decide if a sentence is correct or not
    • A proper sentence needs to have:
      • 1 Subject
      • 1 Verb
      • several objects (depending on the verb’s valence)

TL;DR

  • Task:
    • Linguistic dimension: Syntax, semantics, pragmatics
    • Level: Word, word groups, sentence, beyond sentences
  • Approaches
    • Technique:
      • Rule-based,
      • Statistical,
      • Neural
    • Learning scenario:
    • Supervised,
    • semi-supervised,
    • unsupervised,
    • reinforcement learning
    • Model:
      • Classification,
      • sequence classification,
      • sequence labeling,
      • sequence to sequence,
      • structure prediction

Technique

Hand-written rules to parse the sentences (Rule-based)

‼️Problems

  • There is no fixed set of rules
  • Language changes over time
  • A(ny?) language is constantly influenced by other languages
  • Classification of words into POS tags not always clear

Corpus-based Approaches to NLP (Statistical)

  • Corpus = large collection of annotated texts (or speech files)
  • 👍 advantages:
    • Automatically learn rules from data
    • Statistical Models → no hard decision
    • Use machine learning approaches
      • Possible since larger computation resources
    • Corpus will concentrate on most common approaches
  • Input:
    • Data (Text corpora)
    • Machine learning algorithm
  • Output: Statistical model
截屏2020-05-11 00.43.05
  • Problems of simple statistical models: feature engineering
    • What features are important to determine the POS tag
      • Word ending
      • Surrounding words
      • Capitalization

Deep learning Approaches to NLP (Neural)

  • Use neural networks to automatically infer features
  • Better generalization
  • Successfully applied to many NLP tasks
截屏2020-05-11 00.45.08

Learning scenarios

  • Supervised learning
  • Unsupervised learning
  • Semi supervised learning
  • Reinforcement learning

Model types

Model typeInputOutputExample task
ClassificationFix input size
(E.g. word and surrounding k words)
LabelWord sense disambiguation
Sequence classificationSequence with variable lengthLabelSentiment analysis
Sequence labellingSequence with variable lengthLabel sequence with same lengthNamed entity recognition
Sequence to Sequence modelSequence with variable lengthSequence variable lengthSummarization
Structure predictionSequence with variable lengthComplex structureParsing

Resources

  • Texts
    • Brown Corpus
    • Penn Treebank
    • Europarl
    • Google books corpus
  • Dictionaries/Ontologies
    • WordNet,
    • GermaNet,
    • EuroWordNet

Approaches to Dialog Modeling

  • Many problems of NLP also apply to Dialog Modeling
  • Use conversational corpora for learning interaction patterns
    • Meeting Corpus (multiparty conversation)
    • Switchboard Corpus (telephone speech)
  • Problems ‼️
    • Very domain dependent
    • Need human interaction in training

Why is NLP hard?

Ambiguities! Ambiguities! Ambiguities!

Ambiguities

Examples:

截屏2020-05-11 11.19.29

Rare events

  • Calculate probabilities for events/words

  • Most words occur only very rarely

    • Most words occur one time
    • What to do with words that occur not in training data? 🧐

Zipf’s Law

$$ f \propto \frac{1}{r} $$
  • order list of words by occurrence
  • rank: position in the list

The frequency of any word is inversely proportional to its rank in the frequency table.

Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word

For example, in the Brown Corpus of American English text, the word the is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word of accounts for slightly over 3.5% of words (36,411 occurrences), followed by and (28,852).