Dialog Management

截屏2020-09-20 11.48.55

截屏2020-09-20 11.55.10

Dialog Modeling

Dialog manager

  • Manage flow of conversation

  • Input: Semantic representation of the input

  • Output: Semantic representation of the output

  • Utilize additional knowledge

    • User information

    • Dialog History

    • Task-specific information

🔴 Challenges

  • Consisting of many different components

    • Each component has errors

    • More components –> less robust

  • Should be modular

  • Need to find unambiguous representation

  • Hard to train from data

Dialog Types

Goal-oriented Dialog

  • Follows a fixed (set of) goals
    • Ticket vending machines

    • Restaurant reservation

    • Car SDS

  • Aim: Reach goal as fast as possible
  • Main focus of SDS research

Social Dialog

  • Social Dialog / Conversational Bots / Chit-Chat Setting

  • Most human

  • Small talk conversation

  • Aims:

    • Generate interesting, coherent, meaningful responses

    • Carry-on as long as possible

    • Be a companion

Dialog Systems

Initiative

  • System Initiative

    • Command & control

    • Example (U: User, S: System)

      截屏2020-09-20 12.03.55
  • Mixed Initiative

    • Most nature

    • Example

      截屏2020-09-20 12.05.30
  • User Initiative

    • User most powerful

    • Error-prone

    • Example

      截屏2020-09-20 12.06.52

Confirmation

  • Explicit verification

    截屏2020-09-20 12.08.03
  • Implicit verification

    截屏2020-09-20 12.08.25
  • Alternative verification

    截屏2020-09-20 12.08.41

Development

Components

  • Dialog Model: contains information about

    • whether system, user or mixed initiative?
    • whether explicit or implicit confirmation?
    • what kind of speech acts needed?
  • User Model: contains the system’s beliefs about

    • what the user knows

    • the user’s expertise, experience and ability to understand the system’s utterances

  • Knowledge Base: contains information about

    • the world and the domain
  • Discourse Context: contains information about

    • the dialog history and the current discourse
  • Reference Resolver

    • performs reference resolution and handles ellipsis
  • Plan Recognizer and Grounding Module

    • interprets the user’s utterance given the current context
    • reasons about the user’s goals and beliefs
  • Domain Reasoner/Planner

    • generates plans to achieve the shared goals
  • Discourse Manager

    • manages all information of dialog flow
  • Error Handling

    • errors or misunderstandings detection and recovery

Rule-based Systems

截屏2020-09-20 13.42.02

Finite State-based

  • 💡 Idea: Iterate though states that define actions

  • Dialog flow:

    • specified as a set of dialog states (stages)

    • transitions denoting various alternative paths through the dialog graph

    • Nodes = dialogue states (prompts)

    • Arcs = actions based on the recognized response

  • Example

    截屏2020-09-20 12.57.29

  • 👍 Advantages

    • Simple to construct due to simple dialog control
    • The required vocabulary and grammar for each state can be specified in advance
      • Results in more constrained ASR and SLU
  • 👎 Disadvantages

    • Restrict the user’s input to predetermined words/phrases
    • Makes the correction of misrecognized items difficult
    • Inhibits the user’s opportunity to take the initiative and ask questions or introduce new topics

Frame-based

  • 💡 Idea: Fill slots in a frame that defines the goal

  • Dialog flow:

    • is NOT predetermined, but depends on
      • the contents of the user’s input

      • the information that the system has to elicit

  • Example

    • Eg1

      截屏2020-09-20 13.12.50
    • Eg2

      截屏2020-09-20 13.13.34
  • Slot(/Form/Template) filling

    • One slot per piece of information

    • Takes a particular action based on the current state of affairs

  • Questions and other prompts

    • List of possibilities
    • conditions that have to be true for that particular question or prompt
  • 👍 Advantages

    • User can provide over-informative answers
    • Allows more natural dialogues
  • 👎 Disadvantages

    • Cannot handle complex dialogues

Agent-based

  • 💡 Idea:

    • Communication viewed as interaction between two agents

    • Each capable of reasoning about its own actions and beliefs

    • also about other’s actions and beliefs

    • Use of “contexts”

  • Example

    截屏2020-09-20 13.20.28
  • Allow complex communication between the system, the user and the underlying application to solve some problem/task

  • Many variants depends on particular aspects of intelligent behavior included

  • Tends to be mixed-initiative

    • User can control the dialog, introduce new topics, or make contribution
  • 👍 Advantages

    • Allow natural dialogue in complex domains
  • 👎 Disadvantages

    • Such agents are usually very complex
    • Hard to build 😢

Limitations of Rule-based DM

  • Expensive to build Manual work

  • Fragile to ASR errors

  • No self-improvement over time

Statistical DM

  • Motivation

    • User intention can ONLY be imperfectly known

      • Incompleteness – user may not specify full intention initially
      • Noisiness – errors from ASR/SLU
    • Automatic learning of dialog strategies

      • Rule based time consuming
  • 👍 Advantages

    • Maintain a distribution over multiple hypotheses for the correct dialog state

      • Not a single hypothesis for the dialog state
    • Choose actions through an automatic optimization process

    • Technology is not domain dependent

      • same technology can be applied to other domain by learning new domain data

Markov Decision Process (MDP)

  • A model for sequential decision making problems

    • Solved using dynamic programming and reinforcement learning
    • MDP based SDM: dialog evolves as a Markov process
  • Specified by a tuple $(S, A, T, R)$

    • $S$: a set of possible world states $s \in S$

    • $A$: a set of possible actions $a\in A$

    • $R$: a local real-valued reward function

      $$ R: S \times A \mapsto \mathcal{R} $$
    • $T$: a transition mode

      $$ T(s\_{t-1}, a\_{t-1}, s\_t) = P(s\_t | s\_{t-1}, a\_{t-1}) $$
  • 🎯 Goal of MDP based SDM: Maximize its expected cumulative (discounted) reward

    $$ E\left(\sum\_{t=0}^{\infty} \gamma^{t} R\left(s\_{t}, a\_{t}\right)\right) $$
  • Requires complete knowledge of $S$ !!!

Reinforcement Learning

  • “Learning through trial-and-error” (reward/penalty)

  • 🔴 Problem

    • No direct feedback

    • Only feedback at the end of dialog

  • 🎯 Goal: Learn evaluation function from feedback

  • 💡 Idea

    • Initial all operations have equal probability

    • If dialog was successful –> all operations are positive

    • If dialog was negative –> operations negative

How RL works?

  • There is an agent with the capacity to act

  • Each action influences the agent’s future state

  • Success is measured by a scalar reward signal

  • In a nutshell:

    • Select actions to maximize future reward

    • Ideally, a single agent could learn to solve any task 💪

Sequential Decision Making

  • 🎯 Goal: select actions to maximize total future reward
  • Actions may have long term consequences
  • Reward may be delayed
  • It may be better to sacrifice immediate reward to gain more long-term reward 🤔

Agent and Environment

截屏2020-09-20 15.50.33

At each step $t$

  • Agent:
    • Receives state $s\_t$
    • Receives scalar reward $r\_t$
    • Executes action $a\_t$
  • The environment:
    • Receives action $a\_t$
    • Emits state $s\_t$
    • Emits scalar reward $r\_t$
  • The evolution of this process is called a Markov Decision Process (MDP)

Supervised Learning Vs. Reinforcement Learning

Supervised Learning:

截屏2020-09-20 16.04.21
  • Label is given: we can compute gradients given label and update our parameters

Reinforcement Learning

截屏2020-09-20 16.05.11

  • NO label given: instead we have feedback from the environment
  • Not an absolute label / error. We can compute gradients, but do not yet know if our action choice is good. 🤪

Policy and Value Functions

  • Policy $\pi$ : a probability distribution of actions given a state

    $$ a = \pi(s) $$
  • Value function $Q^\pi(s, a)$ : the expected total reward from state $s$ and action $a$ under policy $\pi$

    $$ Q^{\pi}(s, a)=\mathbb{E}\left[r\_{t+1}+\gamma r\_{t+2}+\gamma^{2} r\_{t+3}+\cdots \mid s, a\right] $$
    • “How good is action $a$ in state $s$?”
      • Same reward for two actions, but different consequences down the road
      • Want to update our value function accordingly

Appoaches to RL

  • Policy-based RL

    • Search directly for the optimal policy $\pi^\*$

      (policy achieving maximum future reward)

  • Value-based RL

    • Estimate the optimal value function $Q^{∗}(s,a)$ (maximum value achievable under any policy)
    • Q-Learning: Learn Q-Function that approximates $Q^{∗}(s,a)$
      • Maximum reward when taking action $a$ in $s$
      • Policy: Select action with maximal $Q$ value
      • Algorithm:
        • Initialized $Q$ randomly
        • $Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r\_{t}+\gamma \cdot \underset{a}{\max} Q\left(s\_{t+1}, a\right)\right)$

Goal-oriented Dialogs: Statistical POMDP

POMDP : Partially Observable Markov Decision Process

  • MDP –> POMDP: all states $s$ cannot observed

    • POMDP based SDM –> reinforcement learning + belief state tracking

      • dialog evolves as a Markov process $P(s\_t | s\_{t-1}, a\_{t-1})$

      • $s\_t$ is NOT directly observable

        –> belief state $b(s\_t)$: prob. distribution of all states

      • SLU outputs a noisy observation $o\_t$ of the user input with prob. $P(o\_t|s\_t)$

  • Specified by tuple $(S, A, T, R, O, Z)$

    • $S, A, T, R$ constitute an MDP

    • $O$: a finite set of observations received from the environment

    • $Z$: the observation function s.t.

      $$ Z(o\_t,s\_t,a\_{t-1}) = P(o\_t|s\_t,a\_{t-1}) $$
  • Local reward is the expected reward $\rho$ over belief states

    $$ \rho(b, a)=\sum\_{s \in S} R(s, a) \cdot b(s) $$
  • Goal: maximize the expected cumulative reward.

  • Operation (at each time step)

    截屏2020-09-20 17.07.48 - World is in unobserved state $s\_t$
    • Maintain distribution over all possible states with $b\_t$

      $$ b\_t(s\_t) = \text{Probability of being in state } s\_t $$
    • DM selects action $a\_t$ based on $b\_t$

    • Receive reward $r\_t$

    • Transition to unobserved state $s\_{t+1}$ ONLY depending on $s\_t$ and $a\_t$

    • Receive obserservation $o\_{t+1}$ ONLY depending on $a\_t$ and $s\_{t+1}$

  • Update of belief state

    $$ b\_{t+1}\left(s\_{t+1}\right)=\eta P\left(o\_{t+1} \mid s\_{t+1}, a\_{t}\right) \sum\_{s\_{t}} P\left(s\_{t+1} \mid s\_{t}, a\_{t}\right) b\_{t}\left(s\_{t}\right) $$
  • Policy $\pi$:

    $$ \pi(b) \in \mathbb{A} $$
  • Value function:

    $$ V^{\pi}\left(b\_{t}\right)=\mathbb{E}\left[r\_{t}+\gamma r\_{t+1}+\gamma^{2} r\_{t+2}+\ldots\right] $$

POMDP model

截屏2020-09-20 23.07.52

  • Two stochastic models

    • Dialogue model $M$
      • Transition and observation probability model
      • In what state is the dialogue at the moment
    • Policy Model $\mathcal{P}$
      • What is the best next action
  • Both models are optimized jointly

    • Maximize the expect accumulated sum of rewards
      • Online: Interaction with user
      • Offline: Training with corpus
  • Key ideas

    • Belief tracking
      • Represent uncertainty

      • Pursuing all possible dialogue paths in parallel

    • Reinforcement learning
      • Use machine learning to learn parameters
  • 🔴 Challenges

    • Belief tracking
    • Policy learning
    • User simulation

Belief state

截屏2020-09-20 23.21.04
  • Information encoded in the state

    $$ \begin{aligned} b\_{t+1}\left(g\_{t+1}, u\_{t+1}, h\_{t+1}\right)=& \eta P\left(o\_{t+1} \mid u\_{t+1}\right) \\\\ \cdot & P\left(u\_{t+1} \mid g\_{t+1}, a\_{t}\right) \\\\ \cdot & \sum_{g\_{t}} P\left(g\_{t+1} \mid g\_{t}, a\_{t}\right) \\\\ \cdot & \sum_{h\_{t}} P\left(h\_{t+1} \mid g\_{t+1}, u\_{t+1}, h\_{t}, a\_{t}\right) \\\\ \cdot & b\_{t}\left(g\_{t}, h\_{t}\right) \end{aligned} $$
    • User goal $g\_t$: Information from the user necessary to fulfill the task
    • User utterance $u\_t$
      • What was said
      • Not what was recognized
    • Dialogue history $h\_t$
  • Using independence assumptions

  • Observation model: Probability of observation $o$ given $u$

    • Reflect speech understanding errors
  • User model: Probability of the utterance given previous output and new state

  • Goal transition model

  • History model

  • Model still too complex 🤪

    • Solution
      • n-best approach
      • Factored approach
      • Combination is possible

Policy

  • Mapping between belief states and system actions
  • 🎯 Goal: Find optimal policy π’
  • Problem: State and action space very large
  • But:
    • Small part of belief space only visited
    • Plausible actions at every point very restricted
  • Summary space: Simplified representation

🔴 Disadvantages

  • Predefine structure of the dialog states

    • Location

    • Price range

    • Type of cuisine

  • Limited to very narrow domain

  • Cannot encode all features/slots that might be useful

Neural Dialog Models

  • End-to-End training

    • Optimize all parameters jointly
  • Continuous representations

    • No early decision
    • No propagation of errors
  • Challenges

    • Representation of history/context
    • Policy- Learning
      • Interactive learning
    • dIntegration of knowledge sources

Datasets

  • Goal oriented

    • bAbI task

      • Synthetic data – created by templates
    • DSTC (Dialog State tracking challenge)

      • Restaurant reservation

      • Collected using 3 dialog managers

      • Annotated with dialog states

  • Social dialog

    • Learn from human-human communication

Architecture

Memory Networks

截屏2020-09-20 17.43.00
  • Neural network model

  • Writing and reading from a memory component

  • Store dialog history

    • Learn to focus on important parts

Sequence-to-Sequence Models: Encoder-Decoder

截屏2020-09-20 22.50.42
  • Encoder

    • Read in Input
    • Represent content in hidden fix dimension vector
    • LSTM-based model
  • Decoder

    • Generate Output
    • Use fix dimension vector as input
    • LSTM-based model
    • EOS symbol to start outputting

Example

截屏2020-09-20 22.52.57
  • Recurrent-based Encoder-Decoder Architecture

  • Trained end-to-end.

  • Encoder

    截屏2020-09-20 22.54.14 截屏2020-09-20 22.54.27 截屏2020-09-20 22.54.47 截屏2020-09-20 22.55.02
  • Decoder

    截屏2020-09-20 22.55.31 截屏2020-09-20 22.55.54

Dedicated Dialog Architecture

截屏2020-09-20 22.57.55

截屏2020-09-20 22.58.59

Training

Supervised learning

  • Supervised: Learning from corpus

  • Algorithm:

    • Input user utterance
    • Calculate system output
    • Measure error
    • Backpropagation error
    • Update weights
  • Problem:

    • Error lead to different dialogue state
    • Compounding errors

Imitation learning

  • Imitation learning
    • Interactive learning

    • Correct mistakes and demonstrate expected actions

  • Algorithm: same as supervised learning
  • Problem: costly

Deep reinforcement learning

  • Imitation learning

    • Interactive learning
    • Feedback only at end of the dialogue
      • Successful/ Failed task

      • Additional reward for fewer steps 👏

  • Challenge:

    • Sampling of different actions
    • Hugh action space