Dialog Management

ζˆͺ屏2020-09-20 11.48.55

ζˆͺ屏2020-09-20 11.55.10

Dialog Modeling

Dialog manager

  • Manage flow of conversation

  • Input: Semantic representation of the input

  • Output: Semantic representation of the output

  • Utilize additional knowledge

    • User information

    • Dialog History

    • Task-specific information

πŸ”΄ Challenges

  • Consisting of many different components

    • Each component has errors

    • More components –> less robust

  • Should be modular

  • Need to find unambiguous representation

  • Hard to train from data

Dialog Types

Goal-oriented Dialog

  • Follows a fixed (set of) goals
    • Ticket vending machines

    • Restaurant reservation

    • Car SDS

  • Aim: Reach goal as fast as possible
  • Main focus of SDS research

Social Dialog

  • Social Dialog / Conversational Bots / Chit-Chat Setting

  • Most human

  • Small talk conversation

  • Aims:

    • Generate interesting, coherent, meaningful responses

    • Carry-on as long as possible

    • Be a companion

Dialog Systems

Initiative

  • System Initiative

    • Command & control

    • Example (U: User, S: System)

      ζˆͺ屏2020-09-20 12.03.55
  • Mixed Initiative

    • Most nature

    • Example

      ζˆͺ屏2020-09-20 12.05.30
  • User Initiative

    • User most powerful

    • Error-prone

    • Example

      ζˆͺ屏2020-09-20 12.06.52

Confirmation

  • Explicit verification

    ζˆͺ屏2020-09-20 12.08.03
  • Implicit verification

    ζˆͺ屏2020-09-20 12.08.25
  • Alternative verification

    ζˆͺ屏2020-09-20 12.08.41

Development

Components

  • Dialog Model: contains information about

    • whether system, user or mixed initiative?
    • whether explicit or implicit confirmation?
    • what kind of speech acts needed?
  • User Model: contains the system’s beliefs about

    • what the user knows

    • the user’s expertise, experience and ability to understand the system’s utterances

  • Knowledge Base: contains information about

    • the world and the domain
  • Discourse Context: contains information about

    • the dialog history and the current discourse
  • Reference Resolver

    • performs reference resolution and handles ellipsis
  • Plan Recognizer and Grounding Module

    • interprets the user’s utterance given the current context
    • reasons about the user’s goals and beliefs
  • Domain Reasoner/Planner

    • generates plans to achieve the shared goals
  • Discourse Manager

    • manages all information of dialog flow
  • Error Handling

    • errors or misunderstandings detection and recovery

Rule-based Systems

ζˆͺ屏2020-09-20 13.42.02

Finite State-based

  • πŸ’‘ Idea: Iterate though states that define actions

  • Dialog flow:

    • specified as a set of dialog states (stages)

    • transitions denoting various alternative paths through the dialog graph

    • Nodes = dialogue states (prompts)

    • Arcs = actions based on the recognized response

  • Example

    ζˆͺ屏2020-09-20 12.57.29

  • πŸ‘ Advantages

    • Simple to construct due to simple dialog control
    • The required vocabulary and grammar for each state can be specified in advance
      • Results in more constrained ASR and SLU
  • πŸ‘Ž Disadvantages

    • Restrict the user’s input to predetermined words/phrases
    • Makes the correction of misrecognized items difficult
    • Inhibits the user’s opportunity to take the initiative and ask questions or introduce new topics

Frame-based

  • πŸ’‘ Idea: Fill slots in a frame that defines the goal

  • Dialog flow:

    • is NOT predetermined, but depends on
      • the contents of the user’s input

      • the information that the system has to elicit

  • Example

    • Eg1

      ζˆͺ屏2020-09-20 13.12.50
    • Eg2

      ζˆͺ屏2020-09-20 13.13.34
  • Slot(/Form/Template) filling

    • One slot per piece of information

    • Takes a particular action based on the current state of affairs

  • Questions and other prompts

    • List of possibilities
    • conditions that have to be true for that particular question or prompt
  • πŸ‘ Advantages

    • User can provide over-informative answers
    • Allows more natural dialogues
  • πŸ‘Ž Disadvantages

    • Cannot handle complex dialogues

Agent-based

  • πŸ’‘ Idea:

    • Communication viewed as interaction between two agents

    • Each capable of reasoning about its own actions and beliefs

    • also about other’s actions and beliefs

    • Use of β€œcontexts”

  • Example

    ζˆͺ屏2020-09-20 13.20.28
  • Allow complex communication between the system, the user and the underlying application to solve some problem/task

  • Many variants depends on particular aspects of intelligent behavior included

  • Tends to be mixed-initiative

    • User can control the dialog, introduce new topics, or make contribution
  • πŸ‘ Advantages

    • Allow natural dialogue in complex domains
  • πŸ‘Ž Disadvantages

    • Such agents are usually very complex
    • Hard to build 😒

Limitations of Rule-based DM

  • Expensive to build Manual work

  • Fragile to ASR errors

  • No self-improvement over time

Statistical DM

  • Motivation

    • User intention can ONLY be imperfectly known

      • Incompleteness – user may not specify full intention initially
      • Noisiness – errors from ASR/SLU
    • Automatic learning of dialog strategies

      • Rule based time consuming
  • πŸ‘ Advantages

    • Maintain a distribution over multiple hypotheses for the correct dialog state

      • Not a single hypothesis for the dialog state
    • Choose actions through an automatic optimization process

    • Technology is not domain dependent

      • same technology can be applied to other domain by learning new domain data

Markov Decision Process (MDP)

  • A model for sequential decision making problems

    • Solved using dynamic programming and reinforcement learning
    • MDP based SDM: dialog evolves as a Markov process
  • Specified by a tuple (S,A,T,R)(S, A, T, R)

    • SS: a set of possible world states s∈Ss \in S

    • AA: a set of possible actions a∈Aa\in A

    • RR: a local real-valued reward function

      R:SΓ—A↦R R: S \times A \mapsto \mathcal{R}
    • TT: a transition mode

      T(s_tβˆ’1,a_tβˆ’1,s_t)=P(s_t∣s_tβˆ’1,a_tβˆ’1) T(s\_{t-1}, a\_{t-1}, s\_t) = P(s\_t | s\_{t-1}, a\_{t-1})
  • 🎯 Goal of MDP based SDM: Maximize its expected cumulative (discounted) reward

    E(βˆ‘_t=0∞γtR(s_t,a_t)) E\left(\sum\_{t=0}^{\infty} \gamma^{t} R\left(s\_{t}, a\_{t}\right)\right)
  • Requires complete knowledge of SS !!!

Reinforcement Learning

  • β€œLearning through trial-and-error” (reward/penalty)

  • πŸ”΄ Problem

    • No direct feedback

    • Only feedback at the end of dialog

  • 🎯 Goal: Learn evaluation function from feedback

  • πŸ’‘ Idea

    • Initial all operations have equal probability

    • If dialog was successful –> all operations are positive

    • If dialog was negative –> operations negative

How RL works?

  • There is an agent with the capacity to act

  • Each action influences the agent’s future state

  • Success is measured by a scalar reward signal

  • In a nutshell:

    • Select actions to maximize future reward

    • Ideally, a single agent could learn to solve any task πŸ’ͺ

Sequential Decision Making

  • 🎯 Goal: select actions to maximize total future reward
  • Actions may have long term consequences
  • Reward may be delayed
  • It may be better to sacrifice immediate reward to gain more long-term reward πŸ€”

Agent and Environment

ζˆͺ屏2020-09-20 15.50.33

At each step tt

  • Agent:
    • Receives state s_ts\_t
    • Receives scalar reward r_tr\_t
    • Executes action a_ta\_t
  • The environment:
    • Receives action a_ta\_t
    • Emits state s_ts\_t
    • Emits scalar reward r_tr\_t
  • The evolution of this process is called a Markov Decision Process (MDP)

Supervised Learning Vs. Reinforcement Learning

Supervised Learning:

ζˆͺ屏2020-09-20 16.04.21
  • Label is given: we can compute gradients given label and update our parameters

Reinforcement Learning

ζˆͺ屏2020-09-20 16.05.11

  • NO label given: instead we have feedback from the environment
  • Not an absolute label / error. We can compute gradients, but do not yet know if our action choice is good. πŸ€ͺ

Policy and Value Functions

  • Policy Ο€\pi : a probability distribution of actions given a state

    a=Ο€(s) a = \pi(s)
  • Value function QΟ€(s,a)Q^\pi(s, a) : the expected total reward from state ss and action aa under policy Ο€\pi

    QΟ€(s,a)=E[r_t+1+Ξ³r_t+2+Ξ³2r_t+3+β‹―βˆ£s,a] Q^{\pi}(s, a)=\mathbb{E}\left[r\_{t+1}+\gamma r\_{t+2}+\gamma^{2} r\_{t+3}+\cdots \mid s, a\right]
    • β€œHow good is action aa in state ss?”
      • Same reward for two actions, but different consequences down the road
      • Want to update our value function accordingly

Appoaches to RL

  • Policy-based RL

    • Search directly for the optimal policy Ο€\*\pi^\*

      (policy achieving maximum future reward)

  • Value-based RL

    • Estimate the optimal value function Qβˆ—(s,a)Q^{βˆ—}(s,a) (maximum value achievable under any policy)
    • Q-Learning: Learn Q-Function that approximates Qβˆ—(s,a)Q^{βˆ—}(s,a)
      • Maximum reward when taking action aa in ss
      • Policy: Select action with maximal QQ value
      • Algorithm:
        • Initialized QQ randomly
        • Q(s,a)←(1βˆ’Ξ±)Q(s,a)+Ξ±(r_t+Ξ³β‹…max⁑aQ(s_t+1,a))Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r\_{t}+\gamma \cdot \underset{a}{\max} Q\left(s\_{t+1}, a\right)\right)

Goal-oriented Dialogs: Statistical POMDP

POMDP : Partially Observable Markov Decision Process

  • MDP –> POMDP: all states ss cannot observed

    • POMDP based SDM –> reinforcement learning + belief state tracking

      • dialog evolves as a Markov process P(s_t∣s_tβˆ’1,a_tβˆ’1)P(s\_t | s\_{t-1}, a\_{t-1})

      • s_ts\_t is NOT directly observable

        –> belief state b(s_t)b(s\_t): prob. distribution of all states

      • SLU outputs a noisy observation o_to\_t of the user input with prob. P(o_t∣s_t)P(o\_t|s\_t)

  • Specified by tuple (S,A,T,R,O,Z)(S, A, T, R, O, Z)

    • S,A,T,RS, A, T, R constitute an MDP

    • OO: a finite set of observations received from the environment

    • ZZ: the observation function s.t.

      Z(o_t,s_t,a_tβˆ’1)=P(o_t∣s_t,a_tβˆ’1) Z(o\_t,s\_t,a\_{t-1}) = P(o\_t|s\_t,a\_{t-1})
  • Local reward is the expected reward ρ\rho over belief states

    ρ(b,a)=βˆ‘_s∈SR(s,a)β‹…b(s) \rho(b, a)=\sum\_{s \in S} R(s, a) \cdot b(s)
  • Goal: maximize the expected cumulative reward.

  • Operation (at each time step)

    ζˆͺ屏2020-09-20 17.07.48 - World is in unobserved state s_ts\_t
    • Maintain distribution over all possible states with b_tb\_t

      b_t(s_t)=Probability of being in state s_t b\_t(s\_t) = \text{Probability of being in state } s\_t
    • DM selects action a_ta\_t based on b_tb\_t

    • Receive reward r_tr\_t

    • Transition to unobserved state s_t+1s\_{t+1} ONLY depending on s_ts\_t and a_ta\_t

    • Receive obserservation o_t+1o\_{t+1} ONLY depending on a_ta\_t and s_t+1s\_{t+1}

  • Update of belief state

    b_t+1(s_t+1)=Ξ·P(o_t+1∣s_t+1,a_t)βˆ‘_s_tP(s_t+1∣s_t,a_t)b_t(s_t) b\_{t+1}\left(s\_{t+1}\right)=\eta P\left(o\_{t+1} \mid s\_{t+1}, a\_{t}\right) \sum\_{s\_{t}} P\left(s\_{t+1} \mid s\_{t}, a\_{t}\right) b\_{t}\left(s\_{t}\right)
  • Policy Ο€\pi:

    Ο€(b)∈A \pi(b) \in \mathbb{A}
  • Value function:

    VΟ€(b_t)=E[r_t+Ξ³r_t+1+Ξ³2r_t+2+…] V^{\pi}\left(b\_{t}\right)=\mathbb{E}\left[r\_{t}+\gamma r\_{t+1}+\gamma^{2} r\_{t+2}+\ldots\right]

POMDP model

ζˆͺ屏2020-09-20 23.07.52

  • Two stochastic models

    • Dialogue model MM
      • Transition and observation probability model
      • In what state is the dialogue at the moment
    • Policy Model P\mathcal{P}
      • What is the best next action
  • Both models are optimized jointly

    • Maximize the expect accumulated sum of rewards
      • Online: Interaction with user
      • Offline: Training with corpus
  • Key ideas

    • Belief tracking
      • Represent uncertainty

      • Pursuing all possible dialogue paths in parallel

    • Reinforcement learning
      • Use machine learning to learn parameters
  • πŸ”΄ Challenges

    • Belief tracking
    • Policy learning
    • User simulation

Belief state

ζˆͺ屏2020-09-20 23.21.04
  • Information encoded in the state

    b_t+1(g_t+1,u_t+1,h_t+1)=Ξ·P(o_t+1∣u_t+1)β‹…P(u_t+1∣g_t+1,a_t)β‹…βˆ‘g_tP(g_t+1∣g_t,a_t)β‹…βˆ‘h_tP(h_t+1∣g_t+1,u_t+1,h_t,a_t)β‹…b_t(g_t,h_t) \begin{aligned} b\_{t+1}\left(g\_{t+1}, u\_{t+1}, h\_{t+1}\right)=& \eta P\left(o\_{t+1} \mid u\_{t+1}\right) \\\\ \cdot & P\left(u\_{t+1} \mid g\_{t+1}, a\_{t}\right) \\\\ \cdot & \sum_{g\_{t}} P\left(g\_{t+1} \mid g\_{t}, a\_{t}\right) \\\\ \cdot & \sum_{h\_{t}} P\left(h\_{t+1} \mid g\_{t+1}, u\_{t+1}, h\_{t}, a\_{t}\right) \\\\ \cdot & b\_{t}\left(g\_{t}, h\_{t}\right) \end{aligned}
    • User goal g_tg\_t: Information from the user necessary to fulfill the task
    • User utterance u_tu\_t
      • What was said
      • Not what was recognized
    • Dialogue history h_th\_t
  • Using independence assumptions

  • Observation model: Probability of observation oo given uu

    • Reflect speech understanding errors
  • User model: Probability of the utterance given previous output and new state

  • Goal transition model

  • History model

  • Model still too complex πŸ€ͺ

    • Solution
      • n-best approach
      • Factored approach
      • Combination is possible

Policy

  • Mapping between belief states and system actions
  • 🎯 Goal: Find optimal policy π’
  • Problem: State and action space very large
  • But:
    • Small part of belief space only visited
    • Plausible actions at every point very restricted
  • Summary space: Simplified representation

πŸ”΄ Disadvantages

  • Predefine structure of the dialog states

    • Location

    • Price range

    • Type of cuisine

  • Limited to very narrow domain

  • Cannot encode all features/slots that might be useful

Neural Dialog Models

  • End-to-End training

    • Optimize all parameters jointly
  • Continuous representations

    • No early decision
    • No propagation of errors
  • Challenges

    • Representation of history/context
    • Policy- Learning
      • Interactive learning
    • dIntegration of knowledge sources

Datasets

  • Goal oriented

    • bAbI task

      • Synthetic data – created by templates
    • DSTC (Dialog State tracking challenge)

      • Restaurant reservation

      • Collected using 3 dialog managers

      • Annotated with dialog states

  • Social dialog

    • Learn from human-human communication

Architecture

Memory Networks

ζˆͺ屏2020-09-20 17.43.00
  • Neural network model

  • Writing and reading from a memory component

  • Store dialog history

    • Learn to focus on important parts

Sequence-to-Sequence Models: Encoder-Decoder

ζˆͺ屏2020-09-20 22.50.42
  • Encoder

    • Read in Input
    • Represent content in hidden fix dimension vector
    • LSTM-based model
  • Decoder

    • Generate Output
    • Use fix dimension vector as input
    • LSTM-based model
    • EOS symbol to start outputting

Example

ζˆͺ屏2020-09-20 22.52.57
  • Recurrent-based Encoder-Decoder Architecture

  • Trained end-to-end.

  • Encoder

    ζˆͺ屏2020-09-20 22.54.14 ζˆͺ屏2020-09-20 22.54.27 ζˆͺ屏2020-09-20 22.54.47 ζˆͺ屏2020-09-20 22.55.02
  • Decoder

    ζˆͺ屏2020-09-20 22.55.31 ζˆͺ屏2020-09-20 22.55.54

Dedicated Dialog Architecture

ζˆͺ屏2020-09-20 22.57.55

ζˆͺ屏2020-09-20 22.58.59

Training

Supervised learning

  • Supervised: Learning from corpus

  • Algorithm:

    • Input user utterance
    • Calculate system output
    • Measure error
    • Backpropagation error
    • Update weights
  • Problem:

    • Error lead to different dialogue state
    • Compounding errors

Imitation learning

  • Imitation learning
    • Interactive learning

    • Correct mistakes and demonstrate expected actions

  • Algorithm: same as supervised learning
  • Problem: costly

Deep reinforcement learning

  • Imitation learning

    • Interactive learning
    • Feedback only at end of the dialogue
      • Successful/ Failed task

      • Additional reward for fewer steps πŸ‘

  • Challenge:

    • Sampling of different actions
    • Hugh action space