Lecture Notes | Haobin Tan

Math

Sat, 04 Jun 2022 00:00:00 +0000

Tutorials

Statistik: Zusammenfassung von Statistik
Statistik Tutorials von Studyflix 👍
Youtube channel “Math by Daniel Jung” (klar erklärt mit Beispiele) 👍

Ereignis und Wahrscheinlichkeit

Sat, 04 Jun 2022 00:00:00 +0000

Ereignisse

Ein endlicher Ergebnisraum eines Zufallsexperimentes ist eine nichtleere Menge

$$ \Omega=\left\{\omega_{1}, \omega_{2}, \ldots, \omega_{N}\right\}. $$ I.e., $\Omega$ enthält alle mögliche Ergebnisse.

Die Elemente $\omega_{n} \in \Omega$ heißen Ergebnisse, die möglichen Ausgänge eines Zufallsexperiments.

Jede Teilmenge $A \subset \Omega$ heißt Ereignis.

Jede einelementige Teilmenge $\left\{\omega_{n}\right\} \subset \Omega$ heißt Elementarereignis (ZUsammenfassung von einem oder mehreren Ergebnissen).

$\rightarrow$ Der Ergebnisraum $\Omega$ (das sichere Ereignis) und die leere Menge $\emptyset$ (das unmögliche Ereignis) sind stets Ereignisse.

Für zwei Ereignisse $A$ und $B$

Gilt $A \subset B$, so ist $A$ ein Teilereignis von $B$.
Der Durchschnitt $(A \cap B)$, die Vereinigung $(A \cup B)$, und die Differenz $(A-B)$ sind auch Ereignisse.
- Durchschnitt und Vereinigung sind kommutativ, assoziativ und distributiv.
Das entgegengesetzte Ereignis $\bar{A}$ von $A$ ist auch ein Ereignis und wird als Negation oder Komplement bezeichnet.
Gilt $A \cap B=\varnothing$, so heißen $A$ und $B$ disjunkt ode unvereinbar .
de MORGANschen Formeln
$$ \begin{array}{l} \overline{A \cup B}=\bar{A} \cap \bar{B} \\ \overline{A \cap B}=\bar{A} \cup \bar{B} \end{array} $$

Beispiel

Würfel werfen.

Ergebnisraum $\Omega = \\{1, 2, 3, 4, 5, 6\\}$ (Also $\|\Omega\| = 6$)
Beispiel Ereignise
- “Der Würfel zeight eine ungerade Zahl.”
- “Der Würfel zeigt eine 3.”
- “Der Würfel zeigt eine 3.” (das unmögliche Ereignis)
Ereignis $A$ = “Der Würfel zeight eine ungerade Zahl.” = $\\{1, 3, 5\\}$. Ereignis $B$ = “Der Würfel zeight eine gerade Zahl” = $\\{2, 4, 6\\}$. $A \cap B = \emptyset$ $\Rightarrow$ $A$ und $B$ sind disjunkt oder unvereinbar.

Reference:

Wahrscheinlichkeit (von Kolmogoroff)

Ein nichtleeres System $\mathfrak{B}$ von Teilmengen eines Ergebnisraums $\Omega$ heißt $\sigma$-Algebra (über $\Omega$), wenn gilt

$$ \begin{array}{c} A \in \mathfrak{B} \quad \Rightarrow \quad \bar{A} \in \mathfrak{B}, \\ A_{n} \in \mathfrak{B} ; n=1,2, \ldots \quad \Rightarrow \quad \bigcup_{n=1}^{\infty} A_{n} \in \mathfrak{B}. \end{array} $$

Ein höchstens abzählbares System

$$\left\{A_{n} \in \mathfrak{B}: A_{k} \cap A_{n}=\varnothing, k \neq n\right\}$$

heißt vollständige Ereignisdisjunktion, wenn gilt $\bigcup_{n=1}^{\infty} A_{n}=\Omega$ .

Kolmogoroffsche Axiome

Gegeben seien ein Ergebnisraum $\Omega$ und eine geeignete $\sigma$-Algebra $\mathfrak{B}$ über $\Omega$. Die Elemente von $\mathfrak{B}$ sind also die Ereignisse eines Zufallsexperiments.

Eine Funktion $P$, die jedem Ereignis $A \in \mathfrak{B}$ eine relle Zahl zuordnet, erfülle

$$ \begin{aligned} \mathrm{P}(\Omega) &=1 \quad &(\text{Normiertheit})\\ \mathrm{P}(A) & \geq 0 \quad \forall A \in \mathfrak{B} \quad &(\text{Nicht-negativität}) \\ \mathrm{P}\left(\bigcup_{n=1}^{\infty} A_{n}\right) &=\sum_{n=1}^{\infty} \mathrm{P}\left(A_{n}\right) \quad A_i \cap A_j = \emptyset, \forall i,j \quad &(\text{Additivität}) \end{aligned} $$

dann heißt $P(A)$ die Wahrscheinlichkeit des Ereignisses $A$.

Beispiel

Würfelwurf

Ergebnisraum $\Omega = \\{1, 2, 3, 4, 5, 6\\}$

Ereignis $E = \text{Zahlen von 1 bis 6}$, also $E_i$ ist die Zahl $i$ (z.B $E_1$ ist die Zahl 1).

Dann haben wir:

$$ \begin{aligned} P(E_1) &= \frac{1}{6} \\ P(E_2) &= \frac{1}{6} \\ P(\Omega) &= \frac{6}{6} = 1 \\ P(E_1 \cup E_2) &= \frac{1}{6} + \frac{1}{6} = \frac{2}{6} \quad (E_1 \cap E_2 = \emptyset) \end{aligned} $$

Reference:

<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
<iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="allowfullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/GtpN4SRESaA?autoplay=0&controls=1&end=0&loop=0&mute=0&start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"
></iframe>
</div>

Hieraus folgt

$$ \begin{aligned} \mathrm{P}(\varnothing) &=0, \\ \mathrm{P}(\bar{A}) &=1-\mathrm{P}(A), \\ 0 \leq \mathrm{P}(A) & \leq 1, \\ \mathrm{P}(A \cup B) &=\mathrm{P}(A)+\mathrm{P}(B)-\mathrm{P}(A \cap B), \\ \mathrm{P}\left(\bigcup_{n=1}^{\infty} A_{n}\right) &=1 \quad \text { für jede vollständige Ereignisdisjunktion } A_{n} . \end{aligned} $$

Bedingte Wahrscheinlichkeiten

Sei $B \subset \Omega$ als vorausgesetztes Ereignis, $A, B \in \mathfrak{B}$ und $\mathrm{P}(B)>0$. Dann heißt

$$ \mathrm{P}(A \mid B)=\frac{\mathrm{P}(A \cap B)}{\mathrm{P}(B)} $$

bedingte Wahrscheinlichkeit von $A$ unter der Bedingung $B$.

Multiplikationsregel für Wahrscheinlichkeiten

$$ \mathrm{P}(A \cap B)=\mathrm{P}(A \mid B) \mathrm{P}(B) $$

Im allgemein ist $\mathrm{P}(A \mid B) \neq \mathrm{P}(B \mid A)$. Es gilt die Beziehung

$$ \mathrm{P}(A \mid B) \mathrm{P}(B)=\mathrm{P}(A \cap B) = \mathrm{P}(B \mid A) \mathrm{P}(A) $$

Verallgemeinierung: Die wiederholte Anwendung der Multiplikationsregel auf den Durchschnitt $N$ zufälliger Ereignisse liefert

$$ \begin{aligned} &\mathrm{P}\left(\bigcap_{n=1}^{N} A_{n}\right) \\ =&\mathrm{P}\left(\bigcap_{n=2}^{N} A_{n} \mid A_{1}\right) \mathrm{P}\left(A_{1}\right) \\ =&\mathrm{P}\left(\bigcap_{n=3}^{N} A_{n} \mid A_{2} \cap A_{1}\right) \mathrm{P}\left(A_{2} \mid A_{1}\right) \mathrm{P}\left(A_{1}\right) \\ =&\mathrm{P}\left(\bigcap_{n=4}^{N} A_{n} \mid A_{3} \cap A_{2} \cap A_{1}\right) \mathrm{P}\left(A_{3} \mid A_{2} \cap A_{1}\right) \mathrm{P}\left(A_{2} \mid A_{1}\right) \mathrm{P}\left(A_{1}\right) \\ =&\mathrm{P}\left(A_{N} \mid \bigcap_{n=1}^{N-1} A_{n}\right) \cdots \mathrm{P}\left(A_{4} \mid A_{3} \cap A_{2} \cap A_{1}\right) \mathrm{P}\left(A_{3} \mid A_{2} \cap A_{1}\right) \mathrm{P}\left(A_{2} \mid A_{1}\right) \mathrm{P}\left(A_{1}\right) \end{aligned} $$

Beispiel

Vereinfachung mit 3 Ereignisse

$$ \begin{array}{ll} &P(A) \cdot P(B \mid A) \cdot P(C \mid A \cap B) \\\\ =&P(A) \cdot \frac{P(A \cap B)}{P(A)} \cdot \frac{P(C \mid A \cap B)}{P(A \cap B)} \\\\ =&P(A \cap B \cap C) \end{array} $$

Ref:

Formel von der totalen Wahrscheinlichkeit

Die Ereignisse $A_{n}(1 \leq n \leq N)$ seien eine vollständige Ereignisdisjunktion (also $A_i \cap A_j = \emptyset, \forall i, j$ ) und es gelte $\mathrm{P}\left(A_{n}\right)>0, \forall n$ . Dann folgt für $\forall B \in \mathfrak{B}$ die Formel von der totalen Wahrscheinlichkeit

$$ \mathrm{P}(B)=\sum_{n=1}^{N} \mathrm{P}\left(B \mid A_{n}\right) \mathrm{P}\left(A_{n}\right) $$

Beispiel

$A \cap \bar{A} = \emptyset$

$$ \begin{array}{l} P(B)&=P(B \cap A)+P(B \cap \bar{A}) \\\\ &=P(A)P(B \mid A)+P(\bar{A})P(B \mid \bar{A}) \end{array} $$

Beispiel

Und wenn $P(B) > 0$ ist, folgt die Formel von Bayes:

$$ \mathrm{P}\left(A_{n} \mid B\right)=\frac{\mathrm{P}\left(B \mid A_{n}\right) \mathrm{P}\left(A_{n}\right)}{\sum_{k=1}^{N} \mathrm{P}\left(B \mid A_{k}\right) \mathrm{P}\left(A_{k}\right)} $$

Im allgemeinen ist $\mathrm{P}(A) \neq \mathrm{P}(A \mid B)$. Gilt aber für $A, B \in \mathfrak{B}$

$$ \mathrm{P}(A \mid B)=\mathrm{P}(A), $$

so heißt $A$ unabhängig von $B$.

Für unabhängige Ereignisse folgt hieraus

$$ \begin{array}{c} \mathrm{P}(A \cap B)=\mathrm{P}(A \mid B) \mathrm{P}(B)=\mathrm{P}(A) \mathrm{P}(B) \\ \mathrm{P}(B \mid A)=\frac{\mathrm{P}(A \cap B)}{\mathrm{P}(A)}=\mathrm{P}(B) \end{array} $$

Glossary

Mon, 01 Mar 2021 00:00:00 +0000

Router

Acronym	Full Name
ISP	Internet Service Provider
IXP	Internet Exchange Point
FIB	Forwarding Information Base
CAM	Content-Addressable Memory

Internet Routing

Acronym	Full Name
AS	Autonomous Systems
IGP	Interior Gateway Protocol
EGP	Exterior Gateway Protocol
ASN	Autonomous Systems Number
CDN	Content Delivery Network
OSBF	Open Shortest Path First
LSA	Link State Advertisement
ABR	Area Border Router
BGP	Border Gateway Protocol
RIB	Routing Information Base

Label Switching

Acronym	Full Name
MPLS	Multiprotocol Label Switching
LSR	Label-switching router
LER	Label edge router
FEC	Forwarding equivalency class
RSVP	Resource ReserVation Protocol
VPN	Virtual Private Networks

Software Defined Network (SDN)

Network Function Virtualization (NFV)

Acronym	Full Name
NAT	Network Address Translation
NFVI	Network Function Virtualization Infrastructure
VNF	Virtualized Network Functions
MANO	Management and Orchestration
SFC	Service Function Chaining

Congestion Control

Acronym	Full Name
RTT	Round Trip Time
EWMA	Exponential Weighted Moving Average
RTO	Retransmission TimeOut
AIMD	Additively Increase Multiplicatively Decrease
AQM	Active Queue Management
ECN	Explicit Congestion Notification
RED	Random Early Detection

Ethernet

Acronym	Full Name
CSMA	Carrier Sense Multiple Access
CD	Collision Detection
CA	Collision Avoidance
IFS	Inter Frame Space
BPDU	Bridge Protocol Data Units
STP	Spanning Tree Protocol
RSTP	Rapid Spanning Tree Protocol (RSTP)

Data Center

Acronym	Full Name
PFC	Priority-based Flow Control
PCP	Priority Code Point
ETS	Enhanced Transmission Selection
PG	Priority Groups
QCN	Quantized Congestion Notification
SPB	Shortest Path Bridging
TRILL	Transparent Interconnection of Lots of Links
IS-IS	Intermediate-System-to-Intermediate-System
DCTCP	Data Center TCP
ECN	Explicit Congestion Notification

TCP Evolution

Acronym	Full Name
TLV	Type-Length-Value
TFO	TCP Fast Open

Accees Networks

Acronym	Full Name
ISDN	Integrated Services Digital Network
NT	Network Termination
DSL	Digital Subscriber Line
ADSL	Asymmetric DSL
SDSL	Symmetric DSL
BRAS	Broadband Remote Access Server

(Diracsche) Delta-Distribution / Delta-Funktion

Sat, 04 Jun 2022 00:00:00 +0000

Definition

Die Delta-Distribution (aka. Dirac-Funktion, Dirac-Maß, Impulsfunktion) ist eine spezielle irreguläre Distribution mit kompaktem Träger.

$$ \begin{array}{c} \delta(x)=0, \quad x \neq 0 \\\\ \displaystyle \int_{a}^{b} \delta(x) \mathrm{d} x=1, \quad a<0Illustration: Delta-Funktion im Ursprung wird als Pfeil bei $x=0$ dargestellt und repräsentiert eine Punktladung (Source: Dirac’sche Delta-Funktion und ihre Eigenschaften).

Delta-Funktion im Koordinatenursprung

Betrachte ein Integral der Delta-Funktion zusammen mit einer Testfunktion $f(x)$
$$ \int_{a}^{b} f(x) \delta(x) \mathrm{d} x $$
Denn $\delta(x)$ ist überall $0$, außer an der Stelle $x=0$.

$\Rightarrow$ $f(x)\delta(x)$ ist überall $0$, außer an der Stelle $x=0$.

$\Rightarrow$ Im Integral bleibt nur der Funktionswert $f(0)$ erhalten, der nicht von $x$ abhängt.

Daher gilt:
$$ \int_{a}^{b} f(x) \delta(x) \mathrm{d} x= \int_{a}^{b} f(0)\delta(x) \mathrm{d} x=f(0) \underbrace{\int_{a}^{b} \delta(x)\mathrm{d} x}_{=1} = f(0) $$
Eigenschaften

Bei Berechnen/Verweden/Überprüfen der Eigenschaften von Dirac-Funktion ist es wichtig, die Substitutionsregel zu verwenden.

Verschobene Delta-Funktion

Verschiebe die Ladung an eine andere Stelle auf der $x$-Achse (z.B an die Stelle $x=x_0$). Das Argument der Delta-Funktion wird zu $\delta(x-x_0)$.

Die verschobene Delta-Funktion mit einer anderen Funktion $f(x)$ im Integral multipliziert:
$$ \int_{a}^{b} f(x) \delta\left(x-x_{0}\right) \mathrm{d} x=f\left(x_{0}\right) $$

Beweis

Nach rechts verschobene Delta-Funktion pickt den Wert $f(x_0)$ der Funktion an der Stelle $x=x_0$.

Beispiel

Beispiel

Eine Delta-Funktion außerhlad der Integrationsgrenzen

Symmetrie

Delta-Funktion ist symmetrisch (gerade)
$$ \delta(x) = \delta(-x) $$

Beweis

Skalierung

Skaliertes Argument der Delta-Funktion
$$ \int_{a}^{b} f(x) \delta(|k| x) \mathrm{d} x=\frac{1}{|k|} f(0) $$

Beweis

Hintereinanderausführung
$$ \int_{-\infty}^{\infty} f(x) \delta(g(x)) \mathrm{d} x=\sum_{i=1}^{n} \frac{f\left(x_{i}\right)}{\left|g^{\prime}\left(x_{i}\right)\right|} $$
wobei $g(x_i) = 0$ und $g^\prime(x_i) \neq 0$.

Beweis

Substituiere
$$ u := g(x) $$
Dann gilt:
$$ \begin{aligned} x &= g^{-1}(u) \\\\ \frac{du}{dx} &= g^\prime(x) = g^\prime(g^{-1}(u)) \end{aligned} $$
Da $\delta(x) \neq 0$ nur bei $x = 0$, können wir den Bereich des Integrals in kleine Intervalle um jede Nullstelle $x_i$ von $g(x)$ aufteilen, wobei $g(x)$ monoton und somit invertierbar ist.
$$ \begin{aligned} \int f(x) \delta(g(x)) d x &=\sum_{i} \int_{x_{i}-\varepsilon_{i}}^{x_{i}+\varepsilon_{i}} f(x) \delta(g(x)) d x \\\\ &=\sum_{i} \int_{g\left(x_{i}-\varepsilon_{i}\right)}^{g\left(x_{i}+\varepsilon_{i}\right)} f\left(g^{-1}(u)\right) \delta(u) \frac{1}{g^{\prime}\left(g^{-1}(u)\right)} d u \\\\ &=\sum_{i} \int_{g\left(x_{i}-\varepsilon_{i}\right)}^{g\left(x_{i}+\varepsilon_{i}\right)} \frac{f\left(g^{-1}(u)\right)}{g^{\prime}\left(g^{-1}(u)\right)} \delta(u) d u \\\\ &=\sum_{i} \int_{g\left(x_{i}-\varepsilon_{i}\right)}^{g\left(x_{i}+\varepsilon_{i}\right)} \frac{f\left(x_{i}\right)}{g^{\prime}\left(x_{i}\right)} \delta(u) d u \quad(\ast) \end{aligned} $$
$g^\prime (x_i) > 0$ :
$$ \begin{aligned} (\ast) &=\sum\_{i} \frac{f\left(x\_{i}\right)}{g^{\prime}\left(x\_{i}\right)} \underbrace{\int\_{g\left(x\_{i}-\varepsilon\_{i}\right)}^{g\left(x\_{i}+\varepsilon\_{i}\right)} \delta(u) d u}\_{=1} \\\\ &=\sum\_{i} \frac{f\left(x\_{i}\right)}{g^{\prime}\left(x\_{i}\right)} \\\\ &=\sum\_{i} \frac{f\left(x\_{i}\right)}{|g^{\prime}\left(x\_{i}\right)|} \end{aligned} $$
$g^\prime (x_i) < 0$ :

Dann ist
$$ g(x_i + \varepsilon_i) < g(x_i - \varepsilon_i) $$
Daher
$$ \begin{aligned} (\ast) &=\sum_{i} \int\_{g\left(x\_{i}+\varepsilon\_{i}\right)}^{g\left(x\_{i}-\varepsilon\_{i}\right)} \frac{f\left(x\_{i}\right)}{g^{\prime}\left(x\_{i}\right)} \delta(u) d u \\\\ &=\sum\_{i} \int\_{g\left(x\_{i}-\varepsilon\_{i}\right)}^{g\left(x\_{i}+\varepsilon_{i}\right)}-\frac{f\left(x_{i}\right)}{g^{\prime}\left(x\_{i}\right)} \delta(u) d u \\\\ &=\sum\_{i} \int_{g\left(x\_{i}-\varepsilon\_{i}\right)}^{g\left(x\_{i}+\varepsilon\_{i}\right)} \frac{f\left(x\_{i}\right)}{\left|g^{\prime}\left(x_{i}\right)\right|} \delta(u) d u \\\\ &=\sum\_{i} \frac{f\left(x\_{i}\right)}{\left|g^{\prime}\left(x\_{i}\right)\right|} \underbrace{\int\_{g\left(x\_{i}-\varepsilon\_{i}\right)}^{g\left(x\_{i}+\varepsilon\_{i}\right)} \delta(u) d u}\_{=1} \\\\ &=\sum_{i} \frac{f\left(x\_{i}\right)}{\left|g^{\prime}\left(x\_{i}\right)\right|} \end{aligned} $$
Also
$$ \int_{-\infty}^{\infty} f(x) \delta(g(x)) \mathrm{d} x=\sum_{i=1}^{n} \frac{f\left(x_{i}\right)}{\left|g^{\prime}\left(x_{i}\right)\right|} \qquad (\square) $$

Ref: Dirac Delta Function of a Function

Reference

Dirac’sche Delta-Funktion und ihre Eigenschaften 👍👍👍

Router

Mon, 01 Mar 2021 00:00:00 +0000

Schematic view and generic architecture of router

Basic Functionalities

Intermediate Systems

Forward data from input port(s) to output port(s)

Forwarding is a task of the data path

May operate on different layers

Hubs operate on layer 1

Bridges operate on layer 2

Routers operate on layer 3

Routing

Determines the path that the packets follow

Routing is part of the control path $\rightarrow$ Requires routing algorithms and routing protocols

Forwarding within a Router

Main task

Lookup in forwarding table

Forward data from input port to output port(s)

🎯 Goals

Forwarding in line speed

Short queues

Small tables

Schematic View of an IP-Router:

Forwarding Functionality

Basic functions

Check the headers of an IP packet

Version number

Valid header length

Checksum

Check time to live

Decrement of TTL field

Recalculate checksum

Lookup

Determine output port for a packet

Fragmentation

Handle IP options

Possibly: differentiated treatment of packets

Classification

Prioritization

Challenge: Line Speed

Bandwidth demand increases

Link capacity has to increase as well to keep up

Types of Routers

Core router

Used by service providers

Need to handle large amounts of aggregated traffic

High speed and reliability essential

Fast lookup and forwarding needed

Redundancy to increase reliability (dual power supply …)

Cost secondary issue

Enterprise router

Connect end systems in companies, universities …

Provide connectivity to large number of end systems

Support of VLANs, firewalls …

Low cost per port, large number of ports, ease of maintenance

Edge router (access router)

At edge of service provider

Provide connectivity to customer from home, small businesses

Support for PPTP, IPsec, VPNs …

Forwarding Table Lookup

Example of a forwarding table

Prefix

Identifies a block of addresses

Continuous blocks of addresses per output port are beneficial

Does not require a separate entry for each IP address $\rightarrow$ Scalability 👏

Longest Prefix Matching

Consider a typical problem: What to do if there are multiple prefixes in the forwarding table that match on a given destination address?

🔧 Solution: Select most specific prefix

most specific prefix = the longest prefix

Example

Efficient Prefix Search

Different approaches for fast prefix search (in software)

Binary trie

Path-compressed trie

Multibit-Tries

Hash tables

Efficient data structures

Requirements

Fast lookup

Low memory

Fast updates

Naïve approach: Simple Array

Variables

$N$ = number of prefixes

$W$ = length of a prefix (e.g., $W=32$ for full IPv4 addresses)

$k$ = length of a stride (only for multibit tries)

How it works?

Store prefixes in a simple array (unordered)

Linear search

Remember best match while walking through array

Evaluation

Worst case lookup speed: $O(N)$ $\rightarrow$ pretty bad 🤪

Memory requirement: $O(N \cdot W)$ $\rightarrow$ pretty bad 🤪

Updates: $O(1)$

Binary Trie

Tries $\rightarrow$ tree-based data structures to store and search prefix information

From „retrieval“ (find something)

💡 Idea: Bits in the prefix tell the algorithms what branch to take

Example

Evaluation

Worst case lookup speed: $O(W)$

Maximum of one node per bit in the prefix

But much better than naïve approach ($W \ll N$)

Memory requirement: $O(N \cdot W)$

Assumption: prefixes stored as linked list starting from root node

Every prefix (out of $N$) can have up to $W$ nodes $\rightarrow$ Maximum of $N \cdot W$ entries

No improvement (compared with naïve approach) 🤪

Updates: $O(W)$

A maximum of $W$ nodes has to be inserted or deleted (similar to lookup procedure)

Performance

Can find prefix in $W$ steps $\rightarrow$ address space = $2^W$

$W = $ number of bits in address ($W = 32$ for IPv4, $W = 128$ for IPv6)

Assumption: separate memory access required for each step

Memory access time $t\_{\text{access}} = 10 ns = 10 ^{-8}s$

Maximum lookups $L$ per second:
$$ t\_{\text {lookup }}=32 * t\_{\text {access }}=320 n s \rightarrow L=\frac{1}{t\_{\text {lookup }}}=3,125,000 \text { lookups} / s $$
For 100 byte packets, this results in only $2.5$ Gbit/s

Example

Construct binary trie

Optimization

Path compression

Multibit-Tries

Path Compression

Long sequences of one-child nodes waste memory

E.g. highlighted (red) search paths in following trie is not required for branching decision

💡 Idea: Eliminate those sequences from trie

Lookup operation

Additional information required

Store bit index that has to be examined next

Evaluation

Worst case lookup speed: $O(W)$

If there are no one-child nodes on a path, number of nodes to search is equal to length of prefix

Memory requirement: $O(N)$

Maximum of $N$ leaf nodes, $N-1$ for the internal nodes

$\rightarrow$ Maximum of $2N-1$ entries

Improvement against binary trie 👏

Updates: $O(W)$

Example

Construct binary trie with path compression

Multibit Trie

Example: Homework 03

Hash Tables

🎯 Obejctives

Improve lookup speed

Hash tables can perform lookup in $O(1)$

However: longest prefix match only with hash table doesn‘t work 🤪

Instead: use an additional hash table

Stores results of trie lookups

E.g., destination IP address 109.21.33.9 $\rightarrow$ output port 2

Significant improvement for large forwarding tables 👏

For each received IP packet

Does an entry for destination IP address exist in hash table?

Yes $\rightarrow$ no trie lookup

No $\rightarrow$ trie lookup

Works well if addresses show „locality“ characteristics

I.e., most IP packets are covered by a small set of prefixes

Not applicable in the Internet backbone

Comparsion between Binary Trie, Path Compression, and Multibit Trie

$N$ = number of prefixes

$W$ = length of a prefix (e.g., $W=32$ for full IPv4 addresses)

$N \gg W$

$k$ = length of a stride (only for multibit tries)

Lookup Speed Memory Requirement Update

Binary trie $O(W)$ $O(NW)$ $O(W)$

Path compression $O(W)$ $O(N)$ $O(W)$

Multibit trie

Longest Prefix Matching in Hardware

RAM-based Access

💡Basic idea

Read information with a single memory access

Use destination IP address as RAM address

🔴 Problem

Independent of number of prefixes in use

IPv4 addresses with length of 32 bit $\rightarrow$ requires 4 GByte

IPv6 addresses with length of 128 bit $\rightarrow$ requires ~$3.4 × 10^{29}$ GByte

Waste of memory

Required memory size grows exponentially with size of address!

Content-Addressable Memory (CAM)

CAM: takes data and returns address (opposite to RAM)

CAM can search all stored entries in a single clock cycle (very fast!)

Application for networking: use addresses as search input to perform very fast address lookups (IP $\rightarrow$ output port)

Structure of CAM

How does CAM work?

Example

Source: [Content-Addressable Memory Introduction](https://www.pagiamtzis.com/cam/camintro/)

Ternary CAM (TCAM)

An extension that supports a „Don‘t Care“ State x (matching both a 0 and a 1 in that position)

Allows longest prefix matching

Prefixes are stored in the CAM sorted by prefix length (from long to short)

👍 Advantage: Very fast lookups (1 clock cycle)

🔴 Problems: Severe scalability limitations

High energy demand

All search words are looked up in parallel

Every core cell is required for every lookup

High cost / low density

TCAM requires 2-3 times the transistors compared to SRAM

Longest matching prefix requires strict ordering of prefixes in the TCAM

New entries can require the TCAM to be „re-ordered“ $\rightarrow$ This can take a significant amount of time!

Example: Homework 04

💡 Idea:

Sort prefixes from according to their length (longest to shortest)

CAM part: (prefix, index) pair

RAM part: (index, egress port) pair

Router Architecture

Basic components

Network interfaces

Realize access to one of the attached networks

Functionalities of layers 1 and 2

Basic functions of IP

Including forwarding table lookup

Routing processor

Routing protocol

Management functionality

Switch fabric

„Backplane“

Realizes internal forwarding of packets from the input to the output port

Generic Router Architecture

Conflicting design goals

High efficiency

Line speed

Low delay

Vs. low cost

Type and amount of required storage

Type of switch fabric

Blocking

E.g., packets arriving at the same time at different input ports that need the same output port

Measures that can help prevent blocking

Overprovisioning

Internal circuits in switch fabric operate at a higher speed than the individual input ports

Buffering

Queue packets at appropriate locations until resources are available At

network interfaces

In switch fabric

Backpressure

Signal the overload back towards the input ports

Input ports can then reduce load

Parallel switch fabrics

Allows parallel transport of multiple packets to output ports

Requires higher access speed at output ports

Buffers

Problem: Simultaneous arrival of multiple packets for an output port

Sequential processing required, since packets can not be sent in parallel

Packets have to be buffered

Example

Packets arrive at input ports E1 and E2 at the same time, both must be forwarded to output A1

One out of the two packets requires buffering

Where to place the memory elements for buffering?

Input buffer

Output buffer

Distributed buffer

Central buffer

Evaluation of Alternatives

Parameters of switch fabric

$N$: Number of input and output ports

$M$: Total storage capacity

$S$: Speedup factor of the switch fabric

According to the speed of the input and output ports

$Z$: Cycle time of memory accesses

According to the transmission time of a packet at input and output ports

Delay und jitter (=variance of the delay)

Important

Additional mechanisms are required, e.g. flow control

Organization of memories, e.g. FIFO or RAM

In the following: simplifying assumptions

All ports operate at same data rate

All packets have same length

Input buffer

💡 Idea: conflict resolution at input of switch fabric

FIFO buffer per input port

Scheduling of inputs, e.g.

Round robin, priority controlled, depending on buffer levels, …

Jitter varies

Switch fabric internally non-blocking, i.e., no internal conflicts 👏

Requirements

Internal exchange with speed of connections ($S=1$)

Cycle time $Z = \frac{1}{2}$ (One packet in, one packet out)

Characteristics

🔴 Problem: Head-of-Line blocking

Waiting packet at head of the buffer blocks packet behind it that could be serviced

Suppose that in the buffer of $I1$, the 1st packet are going to be sent to $O1$ and the 2nd packet are going to be sent to $O2$. But currently the 1st packet is blocked. This caused that the 2nd packet can not be processed, although $O2$ is not occupied. In other words, the 1st packet **blocks** the 2nd packet.

Maximum throughput is 75% for $𝑁 = 2$ and 58,58% for $𝑁 \to \infty$

Output buffer

💡 Idea: conflict resolution at output of switch fabric

FIFO buffer per output port

Switch fabric internally non-blocking, i.e., no internal conflicts

Requirements

Internal switching of packets at $N$ times the speed of the input ports:
$$ S = N $$

Switch fabric internally non-blocking

$\rightarrow$ $N$ inputs must be processed at the same time (simultaneously)

Switching of $N$ packets during one cycle possible $\Rightarrow$
$$ Z = \frac{1}{N + 1} $$

In worst case, a buffer must take $N$ packets in and send one packet out.

Output buffer must be able to accept packets at $N$ times the speed

Input buffer necessary to accept a packet

Characteristics

Maximum throughput near 100%, usually at approx. 80-85%

Good behavior with respect to delay and jitter

Distributed buffer

💡 Idea: conflict resolution inside switch fabric

Switch fabric as matrix

FIFO buffer per crosspoint

Requirements

Matrix structure

Internal exchange with speed of connections: $𝑆 = 1 $

Cycle time: $Z = \frac{1}{2}$

Characteristics

No Head-of-Line blocking 👏

Higher memory requirement $M$ than input or output buffering 🤪

Central buffer

💡 Idea: conflict resolution with shared buffer

All input and output ports are connected to a shared buffer (organization: RAM

Requirements

Cycle time $Z = \frac{1}{2N}$

Address and control memory for address information of packets and control of parallel memory accesses

Characteristics

Significantly lower memory requirements

But: requirements with respect to memory access time are higher 🤪

Buffer placement summary

Switch fabric

Four typical basic structures

Shared memory

Bus / ring structure

Crossbar

Multi-level switching networks

Evaluation

The internal blocking behavior (Blocking / non-blocking)

The presence of buffers (Buffered / unbuffered)

Topology and number of levels of the switching network and number of possible routes

The control principle for packet routing (Self-controlling / table-controlled)

The internal connection concept (Connection oriented / connectionless)

Bus or ring structure

💡 Idea

Conflict-free access through time-division multiplexing

Transmission capacity bus / ring

At least the sum of the transmission capacities of all input ports

Characteristics

Easy support for multicast and broadcast

Spatial extension of a bus system is limited. Usually low number of connections (up to approx. 16)

Crossbar

💡 Idea: Each input connected to each output via crossbar

$N$ inputs, $N$ outputs $\Rightarrow$ $N^2$ crosspoints

Characteristics

Partial parallel switching of packets possible

Multiple packets for the same output $\rightarrow$ Blocking $\to$ Buffering required

High wiring costs with a large number of inputs and outputs

Mostly limited to 2x2 or 16x16 matrices

Especially efficient with packets of the same size

Multi-level Switching Networks

From the switching states of an elementary switching matrix

multilevel connection networks can be set up. E.g.,

Characteristics

Less wiring effort than crossbar

Each input can be connected to each output

Not all connections possible at the same time

internal blocking possible

Self-test

What are important responsibilities of the network layer?

Which basic operations are usually performed by an IP router in order to forward a packet to its destination?

Why are high link-speeds such a big problem for modern forwarding hardware?

How does longest prefix matching work in general?

What are efficient (software) data structures for handling longest prefix matching and how do they work?

In what way can hash tables support a trie-based address lookup?

What is a TCAM?

What are the main benefits and problems of the TCAM technology?

How does the introduced generic router architecture look like?

Where can buffer elements be placed inside a switch? What are the associated benefits and drawbacks?

Zufallsvariable

Sat, 04 Jun 2022 00:00:00 +0000

Zufallsvariablen

Zufallsvariablen werden auf den SI-Übungsblättern durch kleine, fettgedruckte Buchstaben gekennzeichnet, z.B. $X$.

Diese Notation wird nicht auf den handschriftlichen Mitschrieben umgesetzt, sodass Zufallsvariablen und „normale“ Variablen meistens aus dem Kontext heraus unterschieden werden müssen. 🤪

Eine Zufallsvariable ist eine Art Funktion, die jedem Ergebnis $\omega$ deines Zufallsexperiments genau eine Zahl $x$ zuordnet.

ordnet also den Ergebnissen eines Zufallsexperiments reelle Zahlen zu

beschreibt sozusagen das Ergebnis eines Zufallsexperiments, das noch nicht durchgeführt wurde

Man sagt Variable, weil deine Zahl, die du am Ende erhältst, eben variabel ist.

‼️Wichtig: zwischen $X$ und $x$ zu unterscheiden.

$X$: die tatsächliche Zufallsvariable, welche keinen festen Wert hat. Sie bildet das derzeit unbekannte Ergebnis eines Zufallsexperiments ab

$x$: das Ergebnis nach dem Experiment und steht ist somit eine konkrete Zahl.

Bsp: 2 Würfeln werfen

Zufallsvariable $X$ = Augensumme

$P(X = 6)$: “Die Wahrscheinlichkeit, dass die Summe von zwei Würfeln sechs ergibt” (Hier $x=6$)

Diskrete Zufallsvariable

Eine Zufallsvariable wird als diskret bezeichnet, wenn sie nur endlich viele oder abzählbar unendlich viele Werte annimmt.

Sklaenarten: Nominal- oder Ordinalskala

„Abzählbar unendlich“ bedeutet, dass die Menge der Ausprägungen durchnummeriert werden kann.

Bsp: Das Ergebnis beim Würfelwurf ist $x \in \Omega = \\{1, 2, 3, 4, 5, 6\\}$, also $|\Omega| = 6$.

Wahrscheinlichkeitsfunktion

Bei diskreten Zufallsvariablen ermittelt man die Wahrscheinlichkeitsfunktion (Engl. Probability mass function (PMF)), die Wahrscheinlichkeit für ein ganz konkretes Ergebnis angibt.
$$ f(x): \Omega \rightarrow[0,1], x \in \mathbb{N}_{0} $$
Die Funktionswert
$$ f(x) = P(X=x) $$
entspricht der Wahrscheinlichkeit, dass $X$ den Wert $x$ annimmt. Daher gilt
$$ \sum_{x \in \Omega} f(x)=1 $$

Man schreibt für die „Dichte“ einer diskreten Zufallsvariablen, deren Einzelwahrscheinlichkeiten $p_n = P(X = x_n)$ gegeben sind, auch
$$ > f_{X}(x)=\sum_{n=1}^{\infty} \mathrm{P}\left(X=x_{n}\right) \delta\left(x-x_{n}\right)=\sum_{n=1}^{\infty} p_{n} \delta\left(x-x_{n}\right) > $$

$\delta(\cdot)$: Delta-Distribution

Verteilungsfunktion

Die Verteilungsfunktion (aka. Kumulative Wahrscheinlichkeitsdichte, Engl,. Cumulative Distribution Function (CDF)) gibt an, mit welcher Wahrscheinlichkeit das Ergebnis des Zufallsexperiments kleiner oder gleich eines bestimmten Wertes ist.

Dafür werden alle Ergebnisse bis zu diesem Wert aggregiert, also „aufaddiert“. Deshalb spricht man auch oft von einer kumulativen Verteilungsfunktion.

Um die diskrete Verteilungsfunktion zu erhalten, werden schrittweise alle Wahrscheinlichkeitswerte kumuliert. Das heißt, man bildet das Integral unter der Wahrscheinlichkeitsfunktion.
$$ F(x): \boldsymbol{\Omega} \rightarrow[\mathbf{0}, \mathbf{1}], X \in \mathbb{N}_{\mathbf{0}} $$ $$ F(x)= P(X \leq x) = \sum_{x_{i} \leq x} f\left(x_{i}\right) $$
Eigenschaften

$\lim _{x \rightarrow-\infty} F_{X}(x)=0 ; \lim _{x \rightarrow \infty} F_{X}(x)=1$

$F(X)$ ist monoton steigend und rechtseitig stetig

Beispiel

Würfelwurf:

Wahrscheinlichkeitsfunktion:
$$ f(X=k) = \frac{1}{6} \quad k \in \\{1, 2, 3, 4, 5, 6\\} $$
Verteilungsfunktion:
$$ F(3) = P(X \leq 3) = \sum_{i\leq 3}f(X=i) = \frac{1}{3} + \frac{1}{3} + \frac{1}{3} $$

In der SI Vorlesung sowie Übung wird die Verteilungsfunktion der Zufallsvariable $X$ als $F_{X}(x)$ schreiben.

Differenz zwischen kumulativer Wahrscheinlichkeiten:
$$ F(b) - F(a) = P(a < x \leq b) = P(x\leq b) - P(x \leq a) $$
Stetige Zufallsvariable

Eine stetige Zufallsvariable

ist überabzählbar, also nimmt unendlich viele, nicht abzählbare Werte an.

meistens bei Messvorgängen der Fall (z.B. Zeit, Längen oder Temperatur)

Skalenarten: Intervall- oder Rationalskala

Für stetige Zufallsvariable können wir die Wahrscheinlichkeit nur für Intervalle und NICHT für genaue Werte bestimmen.

Es gibt doch unendlich viele Werte, also ist es unmöglich, ein exaktes Ergebnis festzulegen.

z.B.

“Mit welcher Wahrscheinlichkeit ist eine zufällig gewählte Studentin zwischen 165cm und 170cm groß?”

Man benutzt im stetigen Fall die Verteilungsfunktion zur Berechnung von Wahrscheinlichkeiten.

Dichtefunktion

Die Dichtefunktion (Engl. Probability Density Function (PDF)) oder Dichte beschreibt, “Wie dicht liegen die betrachteten Werte um einen beliebigen Punkt?”
$$ f(x): \mathbf{\Omega} \rightarrow \mathbb{R}^{+} $$

Eigenschaften von $f$:

$$ \begin{array}{l} f \text{ ist integrierbar}\\ f(x) \geq 0 \quad \forall x \in \mathbb{R} \\ \displaystyle \int_{-\infty}^{+\infty} f(x) \mathrm{d} x=1 \end{array} $$

Unterschied zu Wahrscheinlichkeitsfunktion

Die Dichtefunktion liefert nicht die Wahrscheinlichkeit, sondern NUR die “Wahrscheinlichkeitsdichte”

Bei der stetigen Zufallsvariable, überabzählbar und unendlich viele Ausprägung hat, ist die Wahrscheinlichkeit für jede konkrete Ausprägung gleich 0
$$ P(X=x) = 0 \quad \forall x \in \mathbb{R} $$

Die Wahrscheinlichkeit, dass $X$ einen Wert $x \in [a, b]$ annimmt , entspricht der Fläsche $S$
$$ P(a \leq x \leq b)=\int_{a}^{b} f(x) \mathrm{d} x=S $$
In der SI Vorlesung sowie Übung wird die Dichtefunktion der Zufallsvariable $X$ als $f_{X}(x)$ schreiben.

Verteilungsfunktion
$$ F(x): \Omega \rightarrow[0,1], x \in \mathbb{R} $$ $$ F(x)=\int f(x) \mathrm{d} x, \quad f(x)=\frac{F(x)}{\mathrm{d} x} $$
Die Verteilungsfunktion ist eigentlich die Fläche unter der Dichtfunktion:
$$ F(x)=P(X \leq x=c)=\int_{-\infty}^{c} f(x) \mathrm{d} x $$
Die Differenz zwischen zwei Verteilungsfunktion ist also:
$$ F(b)-F(a)=P(a \leq x \leq b)=\int_{a}^{b} f(x) \mathrm{d} x $$
Dichtefunktion vs. Verteilungsfunktion

Dichtfunktion beschreibt, wie sind die Wahrscheinlichkeiten konkret verteilt?

Verteilungsfunktion

Summieren der Wahrscheinlichkeiten $\rightarrow$ Bestimmung der Wahrscheinlichkeit für Intervall

liefert die Wahrscheinlichkeit dafür, dass ien Ereignis $\leq$ eines bestimmten Werted eintritt

Diskrete Vs. Stetige Zufallsvariable

Zufalls-
variable Diskret Stetig

Beispiel Würfelwurf Zeit
Temperatur

Wahrscheinlichkeit
für bestimmter/konkreter Punkt
$P(X=x) \in [0, 1]$ NUR für Intervall
($P(X=x) = 0$)

Wahrscheinlichkeitsfunktion/
Dichtefunktion Wahrscheinlichkeitsfunktion
$f(x): \Omega \rightarrow[0,1], x \in \mathbb{N}_{0}$
$f(x) = P(X=x)$
$\sum_{x \in \Omega} f(x)=1$ Dichtefunktion
$f(x): \mathbf{\Omega} \rightarrow \mathbb{R}^{+}$
$f$ ist integrierbar
$f(x) \geq 0 \quad \forall x \in \mathbb{R}$
$\displaystyle \int_{-\infty}^{+\infty} f(x) \mathrm{d} x=1$

Verteilungsfunktion $F(x): \boldsymbol{\Omega} \rightarrow[\mathbf{0}, \mathbf{1}], X \in \mathbb{N}_{\mathbf{0}}$
$F(x)= P(X \leq x) = \sum_{x_{i} \leq x} f\left(x_{i}\right)$ $F(x): \Omega \rightarrow[0,1], x \in \mathbb{R}$
$F(x)=\int f(x) \mathrm{d} x, \quad f(x)=\frac{F(x)}{\mathrm{d} x}$

Note: Man schreibt für die *„Dichte“* einer diskreten Zufallsvariablen, deren Einzelwahrscheinlichkeiten $p_n = P(\boldsymbol{x} = x_n)$ gegeben sind, auch $$ f_{\boldsymbol{x}}(x)=\sum_{n=1}^{\infty} \mathrm{P}\left(\boldsymbol{x}=x_{n}\right) \delta\left(x-x_{n}\right)=\sum_{n=1}^{\infty} p_{n} \delta\left(x-x_{n}\right), $$
wobei $\delta(\cdot)$ die Delta-Distribution ist. Damit gilt sowohl für kontinuierliche als auch für diskrete Zufallsvariablen der Zusammenhang
$$ \frac{d}{d_x} F_{\boldsymbol{x}}(x) = f_{\boldsymbol{x}}(x). $$
Kenntwerte von Zufallsvariablen

Erwartungswert

Erwartungswert (auch Mittelwert) : der Durchschnitt, wenn ein Versuch unendlich oft durchgeführt wird
$$ E_{f_X}\{X\} = \hat{X} = \mu_{X} = \int_{-\infty}^{\infty} x f_{X}(x) d x $$

Notation: $\mu$, $E(X)$, $E\[X\]$, $E\\{X\\}$

Rechenregeln
$\mathrm{E}_{f_{X}}\{aX + b\}=a \mathrm{E}_{f_{X}}\{X\}+b$

Beweis

$$ \begin{array}{ll} &\mathrm{E}\_{f\_{X}}\\{a X+b\\} \\\\ =&\int\_{-\infty}^{\infty}(a x+b) f\_{X}(x) \mathrm{d} x \\\\ =&a \int\_{-\infty}^{\infty} x f\_{X}(x) \mathrm{d} x+b \int\_{-\infty}^{\infty} f\_{X}(x) \mathrm{d} x \\\\ =&a \cdot \mathrm{E}\_{f_{X}}\\{X\\}+b \cdot 1 \end{array} $$

Mehr Regeln:

Basic expectation rules. (Source: kalmanfilter.net)

$k$-te Moment

Der Erwartungswert
$$ \mathrm{E}_{f_X}\left\{X^{k}\right\}=\int_{-\infty}^{\infty} x^{k} f_{X}(x) \mathrm{d} x $$
ist das $k$-te Moment der Zufallsvariable $X$.

Der Erwartungswert
$$ \mathrm{E}_{f_X}\left\{\left(X-\mathrm{E}\{X\}\right)^{k}\right\}=\int_{-\infty}^{\infty}\left(x-\mu_{X}\right)^{k} f_{X}(x) \mathrm{d} x $$
ist das $k$-te zentrale Moment der Zufallsvariable $X$.

Varianz

Varianz := die erwartete quadratische Abweichung vom Erwartungswert
$$ E_{f_X}\{(X - \mu_X)^2\} = \operatorname{Var}(X) = \sigma_X^2 $$

das zweite zentrale Moment

Je größer die Varianz, desto weiter streuen die Werte um $E(X)$

Notationen: $\sigma^2$, $\operatorname{Var}(X)$, $\operatorname{Var}\[X\]$

Rechenregeln
$\operatorname{Var}_{f_X}\{aX+b\} = a^2 \operatorname{Var}_{f_X}\{X\}$

Beweis

$$ \begin{array}{l} &\operatorname{Var}\_{f\_{X}}\\{a X+b\\} \\\\ =&\mathrm{E}\_{f\_{X}}\left\\{\left(a X+b-\mathrm{E}\_{f\_{X}}\\{a X+b\\}\right)^{2}\right\\} \\\\ =&\mathrm{E}\_{f\_{X}}\left\\{\left(a X+b-\left(a \mu\_{X}+b\right)\right)^{2}\right\\}\\\\ =&\mathrm{E}\_{f\_{X}}\left\\{\left(a\left(X-\mu\_{X}\right)\right)^{2}\right\\} \\\\ =&\int\_{-\infty}^{\infty}\left(a\left(X-\mu\_{X}\right)\right)^{2} f\_{X}(x) \mathrm{d} x \\\\ =&a^{2} \int\_{-\infty}^{\infty}\left(X-\mu\_{X}\right)^{2} f\_{X}(x) \mathrm{d} x \\\\ =&a^{2} \mathrm{E}\_{f\_{X}}\left\\{\left(X-\mu\_{X}\right)^{2}\right\\} \\\\ =&a^{2} \operatorname{Var}\_{f\_{X}}\\{X\\} \end{array} $$

$\operatorname{Var}_{f_{X}}\{X\}=\mathrm{E}_{f_{X}}\left\{X^{2}\right\}-\left(\mathrm{E}_{f_{X}}\{X\}\right)^{2}$

Beweis

$$ \begin{aligned} \operatorname{Var}\_{f\_{X}}\\\{X\\}=& \int\_{-\infty}^{\infty}\left(x-\mathrm{E}\_{f\_{X}}\\{X\\}\right)^{2} f\_{X}(x) \mathrm{d} x \\\\ =& \int\_{-\infty}^{\infty}\left(x-\mu\_{X}\right)^{2} f\_{X}(x) \mathrm{d} x \\\\ =& \int\_{-\infty}^{\infty}\left(x^{2}-2 x \mu\_{X}+\mu\_{X}^{2}\right) f\_{X}(x) \mathrm{d} x \\\\ =& \int\_{-\infty}^{\infty} x^{2} f\_{X}(x) \mathrm{d} x-2 \mu\_{X} \int\_{-\infty}^{\infty} x f\_{X}(x) \mathrm{d} x+\mu\_{X}^{2} \int\_{-\infty}^{\infty} f\_{X}(x) \mathrm{d} x \\\\ =& \mathrm{E}\_{f\_{X}}\left\\{X^{2}\right\\}-2 \mu\_{X} \mathrm{E}\_{f\_{X}}\\{X\\}+\mu\_{X}^{2} \cdot 1 \\\\ =& \mathrm{E}\_{f\_{X}}\left\\{X^{2}\right\\}-2 \mu\_{X} \mu\_{X}+\mu\_{X}^{2} \cdot 1 \\\\ =& \mathrm{E}\_{f\_{X}}\left\\{X^{2}\right\\}-\mu\_{X}^{2} \end{aligned} $$

Mehr Regeln:

Basic variance and covariance rules. (Source: kalmanfilter.net)

Beweis für Regel 10

Beweis für Regel 11

Beweis für Regel 13

Beweis für Regel 14

Standardabweichung

Standardabweichung: Streumaß, das die selbe Einheit wie $X$ hat
$$ \sigma=\sqrt{\operatorname{Var}(X)} $$
Groß $\sigma$ $\rightarrow$ große Streuung

Zufalls-
variable Diskret Stetig

Erwartungswert
($\mu$, $E(x)$) $\sum_{i \in \Omega} x_{i} \cdot p_{i}$ $\int_{-\infty}^{+\infty} x \cdot f(x) \mathrm{d} x$

Varianz
($\sigma^2$, $Var(x)$) $\sum_{i \in \Omega}\left(x_{i}-\mu\right)^{2} \cdot p_{i}$ $\int_{-\infty}^{+\infty}(x-\mu)^{2} \cdot f(x) \mathrm{d} x$

Standardabweichung
($\sigma$) $\sqrt{Var(x)}$ $\sqrt{Var(x)}$

Normalverteilte Zufallsvariable

Ein normalverteilte Zufallsvariable $X$ hat die Dichte
$$ f_{X}(x)=\mathcal{N}\left(x-\mu, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}} $$
Ihr $k$-tes zentrales Moment ist allgemein
$$ \mathrm{E}_{f_{X}}\left\{(X-\mu)^{k}\right\}=\left\{\begin{array}{ll} 1 \cdot 3 \cdot 5 \cdots(k-1) \sigma^{k} & \text { falls } k \text { gerade } \\ 0 & \text { falls } k \text { ungerade } \end{array}\right. $$
Die Normalverteilung ist also vollständig durch $\mu$ und $\sigma$ charakterisiert.

Standardisierte Zufallsvariable

Eine Zufallsvariable $X$ mit dem Erwartungswert $\mu_X = E_{f_X}\{X\}$ und der Varianz $\sigma_X^2$ wird durch
$$ Y = \frac{X - \mu_X}{\sigma_X} $$
in eine standardisierte Zufallsvariable $Y$, die den Erwartungswert 0 und die Varianz 1 besitzt, transformiert.

Modalwert, Quantil, Median

Ein Wert, für den die Dichtefunktion $f_X(x)$ ein lokales Maximum annimmt, heißt Modalwert der stetigen Zufallsvariablen $X$.

Ein Wert $x_p$, der den Ungleichungen
$$ P(X < x_p) \leq p, \quad P(X > x_p) \leq 1 - p \quad (0 < p < 1) $$
genügt, heißt $p$-tes Quantil.

Für eine stetige Zufallsvariable X ist ein $p$-tes Quantil $x_p$ gegeben durch $F_X(x_p) = p$

Ein Quantil der Ordnung $p=\frac{1}{2}$ heißt Median der Zufallsvariable $X$

Für normalverteilte Zufallsvariablen fallen Erwartungswert, Modalwert und Median zusammen.

Reference

Wahrscheinlichkeits-, Dichte- und Verteilungsfunktion diskreter und stetiger Zufallsvariablen

Erwartungswert

Kenngrößen (Momente) von Zufallsvariablen I: Erwartungswert, Varianz, Standardabweichung

Internet Routing

Thu, 04 Mar 2021 00:00:00 +0000

Summary

Baiscs

Internet: network of networks

High-level View on an IP Router

Control Plane

Routing protocols

Exchange of routing messages for calculation of routes

Data Plane

Lookup

Forwarding of packets at layer 3

Routing table

Generated by routing protocol

Entries: Mapping of destination IP prefixes to next hop (IP address)

Optimized for the particular routing algorithm

Performance is not critical

Implemented in software

Forwarding table

Used for packet forwarding

Entries: Mapping of IP prefixes to outgoing ports (interface ID and MAC address)

Optimized for longest prefix matching

Performance is critical (lookup in line speed)!

Partially uses dedicated hardware

Routing metric (also named cost, weight)

Metric used by a router to make routing decision

Can be applied to an individual link or to the overall path

Examples

Utilization, latency, data rate

Number of hops

Routing policy

Policy-based routing decisions

Policies are defined by network operator / owner

Distributed Adaptive Routing

Currently most commonly used in the Internet

An instance of the routing protocol in each router

Exchange of routing information via routing messages

Adaptation of the paths to the current situation in the network

Path computation

Network is modeled as graph
$$ G = (N, E) $$

$N$: nodes (routers)

$E$: edges

Links between routers are edges

Edges are associated with metric

Example

Autonomous Systems

Structuring into autonomous systems

Internet routing can be divided into Autonomous Systems (AS)

Routing inside an autonomous system using Interior Gateway Protocol (IGP)

Routing between autonomous systems using Exterior Gateway Protocol (EGP)

Autonomous Systems

Identification: Unique number called Autonomous Systems Number (ASN)

earlier 16 bit; now 32 bit

Properties

Appears as a single entity to the outside

Uniform routing policy

Typically uniform interior routing protocol

Different ASes can use different interior routing protocols

👍 Advantages

Separated administrative domains

Scalability by using two logical levels

Routing protocol inside an AS (not global)

Routing protocol between ASes

Important Properties

Scalability of routing protocols

Overhead increases with size of the network 📈

Space for storing routing information

Number of routing messages to exchange

Computation overhead

Operator autonomy

Choice of interior routing protocol

Hiding of internal network structure

Allocation

IANA (Internet Assigned Numbers Authority) delegates allocation to Regional Internet Registries (RIR), e.g.,

ARIN (North America)

RIPE NCC (Europe, Middle East and Central Asia) APNIC (Asia-Pacific)

LACNIC (Latin America, Caribbean)

AfriNIC (Africa)

Subdivision into ASs

Classification of ASes

Classification based on role

Stub AS

Small organizations and enterprises (Mostly operate only regionally)

Connected to exactly one provider

No transit traffic

Multihomed AS

Large enterprises

Connected to several providers (reliability)

No transit traffic

Transit AS

Provider (Often global scope)

Classification based on “economic position/influence”

Tier 1, tier 2, tier 3 …

Different Roles

End customer

Uses Internet application

Examples

Universities

Enterprises

Customers of Internet Service Providers (ISP)

Content delivery provider

Requested by end customers / Internet application

Provide content

Examples: Google, Akamai, Yahoo, YouTube, Facebook…

Reachability across autonomous systems

Reachability

Main problem

How to ensure mutual reachability?

Cooperation among autonomous systems?

Basic concepts

Transit: Purchased connectivity 💸

Peering: Direct connection, typically between ASes of the same tier

Connectivity and Transit

Establish connectivity

Establish paths to all other ASes in the Internet

AS operator purchases connectivity from one or more ASes

Transit

Internet transit is the service of allowing network traffic to cross or “transit” a computer network, usually used to connect a smaller Internet service provider (ISP) to the larger Internet. (wiki)

Purchased connectivity 💸

Upstream: provider (seller) of transit

Downstream: customer (buyer)

Traffic exchange

In both directions

Only downstream AS must pay; usually volume rate

Transit AS: Provider AS that offers transit

Options for connecting a stub AS

Stub AS

Dualhomed stub AS

Multihomed stub AS

Peering

Private peering

Direct connection between two ASes, usually of same tier

No cost for traffic exchange; costs for network infrastructure apply

However

Mostly only data traffic between privately peered ASes

NO transit traffic of other ASes

Video explanation

Example: peering and transit combination

👍 Advantages

Benefits both ASes: save transit costs, that otherwise would apply

Shorter data paths: fewer AS hops between source and destination

🔴 Problems

Direct connection of ASes complicated (Different geographical locations)

Full mesh of $n$ ASes ($\frac{(n-1)n}{2}$ separate connections!)

Public peering

Through Internet exchange points (IXPs)

Central public authority for interconnection

- Neutral traffic forwarding on layer 2 - No differentiation regardless of customer, content, or type of service - Examples - DECIX (the world’s biggest IXP)

Members / customers: Monthly fixed charges per network port

Necessary for operation and maintenance of IXP‘s switching platform

Different peering policies

Open: AS is open for peering with all other ASes

Selective: Peering only under given terms and conditions

Restrictive: AS does not engage in new peering relationships

No Peering: AS does not do any peering

Autonomous Systems and Transit/Peering

Tier 1

Large global ASes with access to (all) other ASes

Do not buy any transit. Sell transit

Peering with other tier 1 ASes

Examples: Deutsche Telekom, AT&T…

Tier 2

Big national and inter-regional ASes

Connection to providers of Internet applications

Downstream of tier 1 ASes

Sell transit to other ASes

Usually employ peering

Examples: Vodafone, Comcast, Tele2

Tier 3

Small mostly regional ASes

Connections with small providers of Internet applications

Downstream of tier 2 providers

Usually do not sell transit to other ASes

Sell transit mostly to end customers/users

Usually employ peering

Examples: KabelBW, NETHINKS, Alice

Content Delivery Provider

🎯 Goal: FAST delivery of content (i.e. low latencies)

$\rightarrow$ Locations close to tier 1 peering points are preferred

Two basic alternatives

Web servers are hosted directly in tier 1 ASes (Does not require an own AS number)

Web servers are connected over own routers

Content delivery network (CDN)

Own AS number required

Peering with essential providers at important peering points

Examples: Google, Yahoo, Akamai

Content Delivery Network

World wide network with own AS number

Thousands of Points of Presence (PoP) spread across the world

Point of Presence

Consists of access routers und core routers

Access router at the edge of a CDN

Core router inside a CDN

Customers are connecting through access routers

Objectives

Load balancing at access routers

Be close to customers $\rightarrow$ low latencies

Routing in and between Autonomous Systems

Classification

Interior gateway protocols (IGPs) INSIDE one AS

A.k.a intra-domain routing protocols

Are encapsulated inside an AS, i.e., not visible to the outside

Different IGPs in different ASes possible

Metric-based

Exterior gateway protocols (EGPs) BETWEEN ASes

Also named inter-domain routing protocols

Single protocol between all ASes

Policy-based

RIP: Routing Information Protocol

Interior gateway protocol

Very simple protocol that requires very little configuration

RIP in the Protocol Stack

Application process routed implements RIP and manages forwarding table

RIP routing messages are sent over UDP $\rightarrow$ NOT reliable

Routing Metric

Distance between source and destination = number of hops on the path (hop count)

Hop count

Refer to the number of intermediate devices through which data must pass between source and destination.

Each time that a packet of data moves from one router (or device) to another, that is considered one HOP.

An illustration of hops in a wired network. The hop count between the computers in this case is 2.

Limited range of values: 1 - 15

Value of 16 corresponds to “infinity”

RIP Routing Messages

RIP protocol entities exchange routing messages

UDP is used as transport protocol

Types of routing messages

Request message

Requires complete routing table or part of it

Response message for different reasons

Response to specific query

Regular update

Triggered update

Routing Updates

Outgoing

Regular routing update

Periodically, every 30 seconds

Sends entire routing table to all its neighbors

Entries in the routing table are periodically refreshed

No refresh for at least 180 seconds? $\rightarrow$ Hop-Count is set to 16 („infinite“), corresponding route is invalidated

Metric for route changes (triggered update)

Only changes since the last update are communicated, not the complete routing table

Rate limitation in order to reduce load on the network

Incoming

Entry for a destination address does not exist in routing table and received metric is not „infinite“ $\rightarrow$ Insert new entry in routing table

Current entry for a destination address in routing table has larger metric or routing update was sent by the “next router” for this destination $\rightarrow$ Modify entry

Otherwise $\rightarrow$ Ignore routing update

Example

Scenario

Connecting lines represent either direct links or LANs between routers

Ovals represent routers

We have the routing table of router D

30 seconds later D receives new routing update from Router A

A tells D: “Hey, now I can reach Z through 4 hops”.

I.e., now D can reach Z through $4+1=5$ hops

As 5 < 7 (the old number of hops to reach Z), D updates its routing table:

OSPF: Open Shortest Path First

OSPF Basics

Interior gateway protocol

Link state protocol

Each router in the network needs to learn complete topology of the network (Otherwise, calculated paths are inconsistent)

Topology = Nodes and links with their costs (weights)

Each router separately computes shortest paths based on network topology

Dijkstra shortest path algorithm

OSPF in the Protocol Stack

OSPF is located on top of IP $\rightarrow$ OSPF uses an unreliable communication service

Know the Neighbors

Each router

learns its neighbors and

monitors the state of the links to them

Link States of a Router

Router ID of neighbors: dynamically discovered by hello protocol

Availability: dynamically discovered by hello protocol or physical layer

Everything else is configured

Pre-Configuration of OSPF Router

Each router is pre-configured with the following parameters

Router ID: unique ID of a router in the network

Per-interface parameters

Interface IP address (and mask)

Interface output cost – metric

Typically, inversely proportional to link data rate

Routing Metric

Each link is associated with link costs

Example: prefer links with higher data rate
$$ \text { Cost }=\frac{\text { Reference Data Rate }}{\text { Interface Data Rate }} $$

$\text{Reference Data Rate}$ can be configured

E.g., to 1 $𝐺𝑏𝑖𝑡/𝑠$ or 10 $𝐺𝑏𝑖𝑡/𝑠$

Should be consistent across all routers in network

Link State Advertisement (LSA)

Each router constructs router link state advertisements (LSAs)

Router LSAs consist of information about its neighbors and links

Example

Router floods its LSA on all its interfaces $\rightarrow$ All routers in the network must receive an identical copy of this LSA

Link State Database

Each router maintains a link state database

Stores most recent LSAs from all other routers in the network

Link state database is used to

Construct topology graph of the network

Calculate routing table

Routers have identical knowledge of network topology iff their link state databases are synchronized, i.e., they have identical content at all routers.

Initial Synchronization of link state database

(Re-)start of a router

New router has an empty link state database

Initial database synchronization

Router asks neighboring router to share its database Performed

immediately after a “handshake” of the hello protocol

Routers exchange LSA headers with each other

If an LSA is missing it is requested from the neighbor router

$\rightarrow$ the routers are now considered as adjacent to each other

Link State Advertisement

Example

Each LSA is associated with a lifetime (LS Age)

Set to “0” by advertising router

When flooded, incremented by transmission delay (estimated value)

As LSA is stored in database, Age is incremented over time

When LSA’s age reaches MaxAge, LSA is considered out-of-date

MaxAge is set to 1 hour

Consequence: routers must refresh their LSAs every LSRefreshTime

LSRefreshTime is set to 30 minutes

Minimum value between generation of any particular LSA: 5 seconds

Hello Protocol

🎯 Goals

Ensure bi-directional communication between neighboring OSPF routers

Establish and maintain logical adjacencies

Determines identity and liveliness of neighboring routers

Hello Message

Contains own router ID

Contains router ID of neighboring router, if known

If not yet known $\rightarrow$ router ID is set to 0.0.0.0

If own router ID is contained in neighbor’s hello message $\rightarrow$ Communication is considered to be bi-directional

Destination IP address of hello message: 224.0.0.5 (multicast address, “AllSPFRouters”)

$\rightarrow$ hello message is received and processed only by OSPF routers

Simplified Workflow

A router periodically sends a hello message on all its links

“Hello, I am R1, I am still here”

If known, hello message contains router ID of neighboring router

“my neighbor on this link is R2”

If no hello message is received for a pre-defined period of time $\rightarrow$ the link is considered to be down 🤪

Standard value for periodic hello messages: every 10-30 seconds

Fast hello extension: < 1 second

OSPF Message

Header of OSPF messages

Version: OSPF Version, currently 2 for IPv4 and 3 for IPv6

Type

Hello

database description

link state request

link state update

link state acknowledgement

Router ID: ID of originating router

Area ID: OSPF area

Checksum: Internet checksum over entire OSPF message

AUType and Authentication: Optional authentication of originating router

Link State Update Message

Structure of a Link State Advertisement

Consists of a header and a body

LSA header: contains information used to uniquely identify the LSA

Advertising router

Sequence number of LSA at advertising router

…

LSA body: contains information of all operational links of the router

Associated cost

Type of link

Reachability information

…

LSA Header

LS Age

Time in seconds since LSA was originated

Options Optional capabilities supported by OSPF domain

LS Type Router LSA, network LSA …

Link State ID Uniquely identifies an LSA

Advertising Router OSPF router ID of originating router

LS Sequence Number Incremented each time a new LSA is generated

Checksum Over entire message exept age field

Length #bytes for entire LSA including header

Coping with Dynamic Changes

Issuing LSAs

If nothing changes (link, router), nothing needs to be reported with respect to routing $\rightarrow$ keep quiet

LSAs are refreshed every 30 minutes

Besides periodic refreshes, communication is only needed in case of changes

Interface changed to up or down

Neighboring router on link is unreachable

Configuration changes

Minimum time between two consecutive LSAs of a router is set to 5 seconds (Due to stability reasons)

Synchronized Link State Databases

🎯 Goal: link state databases of all routers need to have identical content (need to be synchronized)

Following actions are needed

Ensure that each LSA is received by every router in the network (reliable flooding)

Ensure that all routers consistently either store or discard each LSA $\rightarrow$ fully deterministic comparison rules

Ensure that expired LSAs are deleted from link state databases of every router

Reliable Flooding

Reception of each LSA is acknowledged by neighboring router

Hop-by-hop acknowledgements

Router R stores received LSA

If R does not have an LSA from the advertising router

If the received LSA is newer than the one in the database

def is_new_LSA(received_LSA, cur_stored_LSA, MAX_AGE=60): if received_LSA.sequence_Nr > cur_stored_LSA.sequence_Nr: return True elif received_LSA.sequence_Nr == cur_stored_LSA.sequence_Nr: if received_LSA.checksum > cur_stored_LSA.checksum or cur_stored_LSA.age == MAX_AGE or received_LSA.age < cur_stored_LSA.age: return True else: return False

If R stores the LSA, it forwards it to its neighbors

Uses multicast address 224.0.0.5 with hop limit of 1

LSA Flooding Example

Router R receives LSA from advertising router R1

if received_LSA.age == MAX_AGE and no LSA from R1 is known

$\rightarrow$ Send ACK and discard

If there is no LSA from R1 in database or received LSA is newer $\rightarrow$

Store/replace LSA

Send ACK

Update Age and flood LSA to neighbors

If already stored copy is newer

$\rightarrow$ Send stored copy back to advertising router R1

If LSA and stored copy are the same

$\rightarrow$ Discard LSA

Re-compute routes if content of link state database changed

OSPF Areas

Basic situation: Autonomous systems can grow rather large

Scalability problem

LSA flooding and

Route computation overhead

$\rightarrow$ do NOT scale 🤪

🔧 Solution: Divide an AS into areas (i.e., introduce additional level of hierarchy)

Apply routing only within an area

LSA flooding and route computation limited to an area

Only routers within the same area have identical link state databases.

Areas exchange summary information with each other

Addresses reachable from these areas

Typical size of an area: <100 routers

OSPF Areas structure

Two levels of hierarchy

Area 0 – backbone of the autonomous system Backbone must be always connected

All other areas are directly connected to backbone

Area border routers (ABRs) interconnect areas

They belong to both: their area and the backbone

They run an instance of OSPF for each area they are connected to

They generate summary LSAs

Contain ABR’s routing table for corresponding area

List of destinations reachable within the area

Associated with path cost from the ABR to destination

ABR ́s routing table is constructed after intra-area path computation

Handle summary LSAs: Same way as “regular” LSAs

ABR forward summary LSAs of an area into backbone

ABR forward summary LSAs from backbone into the area

Inter-Area Forwarding

Data between areas are forwarded through backbone (area 0)

End-to-end path consists of path segments

Segment between source and ABR of originating area

Segment between two ABRs in area 0, and

Segment between ABR of target area and destination

Routers within an area select ABRs so that resulting end-to-end path is a shortest path

Based on path costs of ABRs

Example

RIP vs. OSPF

RIP: based on distance vector

🔴 Problems

Limited in metric selection and size

Only one metric (hop count)

Maximum path length of 15 hops

Periodic updates every 30 seconds, even without changes

Slow convergence, count-to-infinity $\rightarrow$ Not suitable for large networks 🤪

👍 Advantage: easier and requires less resources than OSPF

Still sometimes used in small networks

OSPF: based on link-state

Addresses shortcomings of RIP

Faster convergence, no count-to-infinity, lower signaling overhead … 👏

Large networks can be divided into areas

Standard in large ASes (together with IS-IS)

BGP: Border Gateway Protocol

Good explanation:

BGP for Humans: Making Sense of Border Gateway Protocol

Exterior Gateway Protocols

In aforementioned section, we have devided a large networks into different autonomous systems (ASes). In order to make autonomous systems to be able to communicate with each other, there should be at least one special intermediate system that serves as an interface to other ASes.

👍 Advantages:

Scalability

Size of routing tables depends on size of AS

Changes in routing tables are only propagated within an AS

Autonomy

Internet = network of networks

Routing can be controlled in the own network

Uniform interior routing protocol within the AS

Interior routing protocols of different ASes do not have to be identical

Border Gateway Protocol

The most important exterior gateway protocol

Path vector protocol

Extension of distance vector approach

BGP distributed paths, not metrics like costs etc.

With paths it is easy to guarantee that no loops exist

Based on policies of network operator

BGP in a Nutshell

What exactly is being distributed?

Paths (also called routes) that consist of

Target: prefixes (also called: network, network prefixes, IP address ranges)

Attributes: path, next hop

Each traversed AS adds its own AS number to the path

Traffic “follows” UPDATE messages in opposite direction

Example: HW07

AS 100 announces prefix 1.6.17.0/24. Describe how the routing information is distributed in the network.

The other two UPDATE messages (sent from R1 to R31 and R21) are handled in a similar way.

BGP Structure

External BGP (EBGP)

Spoken between BGP routers of neighboring ASes

Announcement and forwarding of path information

Internal details of AS are NOT exchanged

Internal BGP (IBGP)

Between BGP routers within an AS

Synchronization of BGP routers of an AS

Establishment of transit routes

Categorization of Routing Protocols

Interplay of the Routing Approaches

Routing with BGP and IGPs

Assume Alice wants to sent a packet to an external target ( not part of the local IGP domain, e.g., 2.3.4.5).

How does IGP router know what to do with this packet?

Is not strictly prescribed by BGP

Network operators can configure this freely

Different approaches possible

Approach 1: IGP distributes “default” routes

Unknown address/prefix packets are routed to default BGP router via shortest path

Good option for stub ASes

Not practicable for transit ASes

Example

Approach 2: Publication of external routes via IGP

Allows more fine-grained control such as „all Google traffic goes this way“

Cannot be done with all external routes (scalability!)

Usually combined with default route

Example

Approach 3: IGP router also speaks BGP

Forwarding table is build from two routing tables (BGP + IGP)

Often the case with big backbone providers

Example

BGP-Sessions

Point-to-point

Usually only between directly connected routers

Neighbors are called “peers”

BGP uses TCP connections between these routers

How to establish TCP connection without working IP routing?

IBGP: IGP of AS can be used

EBGP

Usually direct physical connection $\rightarrow$ no routing required

Manual configuration at both ends of connection

IBGP Connections

Simplest case: all BGP routers are fully meshed and connected directly to each other

BGP sessions must be kept active all the time

Bad scalability 👎

Alternative: Concentrate IBGP traffic in a single router

Called route reflector

Only route reflector has to maintain sessions with everyone else

More than one reflector used in practice for reliability reasons

Alternative: Form hierarchies of sub ASes

Called AS confederations

Can also be used to implement more complex policies

Confederation appears to outside as single AS

BGP Messages

OPEN

Establishment of BGP connection to peer BGP router

Important: TCP connection must already exist!

Authentication

UPDATE

Announcement of new or withdrawal of outdated path

Attention: Only sent if new, better paths available

KEEPALIVE

Keeps connection alive in absence of UDPATE messages

Acknowledgment for an OPEN request

Recommended KeepAliveTimer: 30 s

NOTIFICATION

Error message and tear down of BGP connection

Routing Information Base

BGP provides mechanisms for distributing path information

Does NOT dictate how routes should be chosen

No predefined routing metric

BGP uses policies

BGP instance of a router collects received and dispatched routing information in various internal tables

Routing Information Base (RIB)

Mainly for logical structuring

Structure

Adj-RIB-In (Adjacency RIB Incoming)

Exists per peer

Stores information received from this peer

Loc-RIB (Local RIB, Routing Information Base)

„Actual routing table“

Only preferred (= best=shortest) routes to destination networks are included here

Forwarding Information Base (FIB) is build based on Loc-RIB

Adj-RIB-Out (Adjacency RIB Outgoing)

Exists per peer

Contains routes published to this peer

Routing Table example: HW08

AS 100 announces prefix 1.6.17.0/24. Fill out the simplified routing table of R5

🔴 Challenges

BGP “struggles” with many challenges and problems, e.g.,

Maintaining scalability

Security problems

Zweidimensionale Zufallsvariable

Sun, 05 Jun 2022 00:00:00 +0000

Verteilungsfunktion und Dichte

Eine vektorwertige Funktion
$$ \underline{X}=\underline{X}(\omega): \Omega \rightarrow \mathbb{R}^{2} $$
die jedem Ergebnis $\omega \in \Omega$ einen Vektor $\underline{x}=\left[\begin{array}{l}x_{1} \\ x_{2}\end{array}\right]$ zuordnet, heißt mehrdimensionale Zufallsvariable, wenn das Urbild eines jeden Intervalls $I_{\underline{a}}=\left(-\infty, a_{1}\right] \times\left(-\infty, a_{2}\right] \subset \mathbb{R}^{2}$ ein Ereignis ist
$$ X^{-1}\left(I_{a}\right) \in \mathfrak{B}, \quad \forall \underline{a} \in \mathbb{R}^{2}. $$
Verteilungsfunktion

Die Funktion
$$ \begin{aligned} F_{\underline{X}}(\underline{x}) &=F_{X_{1}, X_{2}}\left(x_{1}, x_{2}\right) \\ &=\mathrm{P}\left(X_{1} \leq x_{1}, X_{2} \leq x_{2}\right) \end{aligned} $$
der zweidimensionalen Zufallsvariablen $\underline{X}$ heißt Verteilungsfunktion von $\underline{X}$.

Dichte

Die Dichte der zweidimensionalen Zufallsvariablen $\underline{X}$: partielle Ableitungen der Verteilungsfunktion $F_{\underline{X}}(\underline{x})$
$$ f_{\underline{X}}(\underline{x})=f_{X_{1}, X_{2}}\left(x_{1}, x_{2}\right)=\frac{\partial^{2}}{\partial x_{1} \partial x_{2}} F_{X_{1}, X_{2}}\left(x_{1}, x_{2}\right) $$
Sind beide Komponenten diskret verteilt, schreibt man für deren „Dichte“
$$ f_{\underline{X}}(\underline{x})=\sum_{n=1}^{\infty} \sum_{k=1}^{\infty} \mathrm{P}\left(X_{1}=x_{1, n}, X_{2}=x_{2, k}\right) \cdot \delta\left(x_{1}-x_{1, n}, x_{2}-x_{2, k}\right) $$
mit der zweidimensionalen $\delta$- Distribution $\delta(x_1, x_2)$ und den Einzelwahrscheinlichkeiten $\mathrm{P}\left(X_{1}=x_{1, n}, X_{2}=x_{2, k}\right)$.

Randdichten und bedingte Dichten

$\underline{X}$ sei eine zweidimensionale Zufallsvariable mit der Dichte $f(\underline{X})=f_{\underline{X}}\left(x_{1}, x_{2}\right)$. Dann heißen
$$ \begin{array}{l} f_{X_{1}}\left(x_{1}\right)=\int_{-\infty}^{\infty} f_{\underline{X}}\left(x_{1}, x_{2}\right) \mathrm{d} x_{2} \\ f_{X_{2}}\left(x_{2}\right)=\int_{-\infty}^{\infty} f_{\underline{X}}\left(x_{1}, x_{2}\right) \mathrm{d} x_{1} \end{array} $$
Randdichten von $X$.

$X$ sei eine zweidimensionale Zufallsvariable mit der Dichte $f_X(x_1, x_2)$ und es gelte $f_{X_1}(x_1) > 0$ und $f_{X_2}(x_2) > 0$ . Dann heißt
$$ f_{X_{1}}\left(x_{1} \mid X_{2}=x_{2}\right)=\frac{f_{\underline{X}}\left(x_{1}, x_{2}\right)}{f_{X_{2}}\left(x_{2}\right)} $$
die bedingte Dichte von $X_1$ unter der Bedingung $X_2 = x_2$.
$$ f_{X_{2}}\left(x_{2} \mid X_{1}=x_{1}\right)=\frac{f_{\underline{X}}\left(x_{1}, x_{2}\right)}{f_{X_{1}}\left(x_{1}\right)} $$
ist die bedingte Dichte von $X_2$ unter der Bedingung $X_1 = x_1$.

Formel von der totalen Wahrscheinlichkeit für Dichten
$$ f\_{X\_{1}}\left(x\_{1}\right)=\int\_{-\infty}^{\infty} f\_{X\_{1}}\left(x\_{1} \mid X\_{2}=x_{2}\right) f\_{X\_{2}}\left(x\_{2}\right) \mathrm{d} x\_{2} $$

Satz von Bayes für Dichten
$$ f\_{X\_{2}}\left(x\_{2} \mid X\_{1}=x\_{1}\right)=\frac{f\_{X\_{1}}\left(x\_{1} \mid X\_{2}=x\_{2}\right) f\_{X\_{2}}\left(x\_{2}\right)}{\int\_{-\infty}^{\infty} f\_{X\_{1}}\left(x\_{1} \mid X\_{2}=x\_{2}\right) f\_{X\_{2}}\left(x\_{2}\right) \mathrm{d} x\_{2}} $$

Der bedingte Erwartungswert einer Zufallsvariablen $X_1$ unter der Bedingung $X_2 = x_2$ ist
$$ \mathrm{E}_{f_{\underline{\underline{x}}}}\left\{X_{1} \mid X_{2}=x_{2}\right\}=\int_{-\infty}^{\infty} x_{1} f_{X_{1}}\left(x_{1} \mid X_{2}=x_{2}\right) \mathrm{d} x_{1} $$
Unabhängigkeit von Zufallsvariablen

Zwei Zufallsvariablen $X, Y$ heißen unabhängig , wenn gilt
$$ f_{X, Y}(x, y)=f_{X}(x) \cdot f_{Y}(y) $$
Damit gilt auch
$$ f_{X}(x \mid Y=y)=f_{X}(x) $$
Erwartungswert für zweidimensionale Zufallsvariablen:
$$ \mathrm{E}_{f_{X, Y}}\{g(X, Y)\}=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x, y) f_{X, Y}(x, y) \mathrm{d} x \mathrm{~d} y $$
Die Kovarianz $\sigma_{X, Y}=\operatorname{Cov}_{\boldsymbol{f}_{X, Y}}\{X, Y\}$ von zwei Zufallsvariablen $X$ und $Y$ ist
$$ \sigma_{X, Y}=\operatorname{Cov}_{f_{X, Y}}\{X, Y\}=\mathrm{E}\{(X-\mathrm{E}\{X\}) \cdot(Y-\mathrm{E}\{Y\})\}=\mathrm{E}\left\{\left(X-\mu_{x}\right) \cdot\left(Y-\mu_{y}\right)\right\} $$
Der Korrelationskoeffizient von $X$ und $Y$:
$$ \rho_{X, Y}=\frac{\operatorname{Cov}_{f_{X, Y}}\{X, Y\}}{\sqrt{\operatorname{Var}_{f_{X}}\{X\} \operatorname{Var}_{f_{Y}}\{Y\}}}=\frac{\sigma_{X, Y}}{\sigma_{X} \cdot \sigma_{Y}} \in [-1, 1] $$

stellt ein Ähnlichkeitsmaß der Zufallsvariablen $X$ und $Y$ dar

$\left|\rho_{X, Y}\right|=1$: $X$ und $Y$ sind maximal ähnlich

$\left|\rho_{X, Y}\right|=0$: $X$ und $Y$ sind komplett unähnlich (i.e., $X$ und $Y$ sind unkorreliert)

Unabhängige Zufallsvariablen sind unkorreliert. (Die Umkehrung dieser Aussage gilt im allgemeinen NICHT!)

Haben $X$ und $Y$ eine Normalevwrteilung und hat $[X, Y]^\top$ eine zweidimensionale Normalverteilung, folgt aus Unkorreliertheit $\rho_{X, Y} = 0$ auch die Unabhängigkeit von $X$ und $Y$

Ist $\underline{X}=\left\{X_{1}, X_{2}, \ldots, X_{N}\right\}^{\top}$ ein $N$-dimensional Zufallsvektor, seine Kovarianzmatrix ist
$$ \begin{array}{l} \operatorname{Cov}_{f_{\underline{x}}}\{\underline{X}\}=\mathrm{E}_{f_{\underline{\underline{x}}}}\left\{(\underline{X}-\underline{\mu})(\underline{X}-\underline{\mu})^{\top}\right\}\\ \newline =\left[\begin{array}{cccc} \operatorname{Var}_{X_{1}}\left\{X_{1}\right\} & \operatorname{Cov}_{X_{1}, X_{2}}\left\{X_{1}, X_{2}\right\} & \cdots & \operatorname{Cov}_{X_{1}, X_{N}}\left\{X_{1}, X_{N}\right\} \\ \operatorname{Cov}_{X_{2}, X_{1}}\left\{X_{2}, X_{1}\right\} & \operatorname{Var}_{X_{2}}\left\{X_{2}\right\} & \cdots & \mathrm{Cov}_{X_{2}, X_{N}}\left\{X_{2}, X_{N}\right\} \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}_{X_{N}, X_{1}}\left\{X_{N}, X_{1}\right\} & \operatorname{Cov}_{X_{N}, X_{2}}\left\{X_{N}, X_{2}\right\} & \cdots & \operatorname{Var}_{X_{N}}\left\{X_{N}\right\} \end{array}\right]\\ \newline =\left[\begin{array}{cccc} \sigma_{X_{1}}^{2} & \rho_{X_{1}, X_{2}} \sigma_{X_{1}} \sigma_{X_{2}} & \cdots & \rho_{X_{1}, X_{N}} \sigma_{X_{1}} \sigma_{X_{N}} \\ \rho_{X_{2}, X_{1}} \sigma_{X_{2}} \sigma_{X_{1}} & \sigma_{X_{2}}^{2} & \cdots & \rho_{X_{2}, X_{N}} \sigma_{X_{2}} \sigma_{X_{N}} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{X_{N}, X_{1}} \sigma_{X_{N}} \sigma_{X_{1}} & \rho_{X_{N}, X_{2}} \sigma_{X_{N}} \sigma_{X_{2}} & \cdots & \sigma_{X_{N}}^{2} \end{array}\right] \end{array} $$

Detail

Eine Kovarianzmatrix ist stets symmetrisch und positiv definit (oder positiv semidefinit).

Label Switching

Wed, 10 Mar 2021 00:00:00 +0000

Summary of Label Switching

Motivation

Issues related to IP based routing

Lookup is rather complex

Longest matching prefix $\rightarrow$ high performance forwarding needed

Shortest path routing selects shortest path to destination

Multiple paths to destination can not be utilized concurrently $\rightarrow$ traffic engineering desirable

Strictly packet based

Each IP datagram is handled individually – no support for data streams (flows) 🤪

Flows

What is a flow?

A flow is a sequence of packets traversing a network that share a set of header field values.

Different levels of granularity possible, e.g.,

All packets belonging to a particular TCP connection

HTTPS traffic

VoIP traffic

Of a particular sender

Within a network

Example

Flow Based Forwarding

Fundamental concept, independent of certain layers

Can span multiple layers

Incorporates classic routing/forwarding concepts

Goes beyond classic concepts

Aggregation

Micro-flows

Consider a single “connection” e.g., a TCP connection

Fine grained control

High number of flows possible

Macro-flows

Higher level of aggregation

Aggregation of several “connections”

e.g., IP destination address in specific subnet

Lower number of flows

Label Switching

Classification of Communication Networks

Label Switching

Combination of

Packet switching

Packets are forwarded individually (data path is NOT fixed)

Packets include metadata needed for forwarding decision

Circuit switching

Paths established for flows through the network (data path is fixed)

Simple forwarding decision

Differentiation of flows possible

Load balancing

Quality of service (QoS)

Implementation

Switching at layer 2, Instead of routing at layer 3

Labels: Identification which is only locally valid

Virtual circuits: Sequence of labels

Label

Short unstructured identification of fixed length

Does NOT carry any layer-3-information

Unique: only locally at the corresponding switch

Label swapping: Mapping from input label to output label

Virtual circuit: Identified through sequence of labels at the path

Transport of Label

Label must be transported within the packet

Additional „header“ in the packet, between headers of layer 2 and layer 3 $\rightarrow$ layer 2.5

Alternative: In specialized fields within existing packet headers

IPv6: flow label (20 bit field in IPv6 header, to identify micro flows more easily)

Label Switching Domain

Basic architecture

Border of the domain (edge devices)

Add / remove label

Map flow to forwarding class

Access control

…

Within the domain (switching device)

Forward packets based on label information

Label swapping

Label Forwarding Information Base

Forwarding table in case of label switching: Efficient access through label (NO longest prefix matching needed).

Example:

Multiprotocol Label Switching (MPLS)

General Aspects

MPLS

Based on label switching

Originally: data plane optimization

Standardized within the IETF

Increasingly applied in larger autonomous systems

Main Features

Fast forwarding (due to reduced amount of packet processing)

QoS support

Guarantees on latency and capacity, e.g., for voice traffic

Traffic engineering

Supports load balancing in order to optimize network utilization …

Virtual private networks

Isolate traffic from other packets on the Internet

Multiple networks support

Usable on different network technologies, e.g., IP, ATM …

👍 Advantages

Clear separation of forwarding (label switching) and control (manipulation of label binding)

Not limited to IP

Support of metrics

Versatile concept

Scales

Architecture, Components and Basic Operation

Architecture

Components

Label-switching router (LSR)

MPLS-capable IP router

Can forward packets based on both, IP prefixes and MPLS labels

Typically: IP for control plane and MPLS for data plane

Architecture:

Label edge router (LER)

Router at the edge of an MPLS domain

Each LSR with a non-MPLS capable neighbor is an LER

Also called: label ingress router resp. label egress router

Classifies packets that enter the MPLS domain

Forwarding equivalency class (FEC)

MPLS-Node: General term for MPLS-capable intermediate systems, like LSRs

Forwarding Equivalence Classs

Class of packets that should be treated equally

Same path through the network

Same QoS properties

Basis for label assignment

MPLS-specific term, roughly comparable to „flow“

Example

Same address prefix and same type-of-service field

Same IP addresses and same port numbers

VoIP traffic with destination address in subnet X

Granularity

Coarse-grained: Important for quick forwarding and scalability

Fine-grained: Important for differentiated treatment of packets or flows

Example 1: Very fine granular FEC (“micro flow”)

A single TCP connection, identified by 5-tuple

Example 2: data streams differentiation

Traffic engineering

Usage of different paths

Goals

Load balancing

Utilization of all available resources

Prioritization of individual data streams

(realized through separate virtual connections)

Support of quality of service

Different quality of service for different data streams

Label Switched Path

Virtual connection: Sequence of labels on a path through MPLS domain.

Example:

MPLS-Label

Encapsulation: Between headers of layer 2 (Data Link layer) and layer 3 (Network layer)

Label: the label itself

Exp: Bits for experimental usage

S: Stack-bit

TTL: Time-to-live

Label Distribution

Label Binding

Associate specific label to FEC

Stored in label forwarding information base

Used as incoming label

Label distribution

Label binding is distributed to neighboring routers

Stored in label forwarding information base

Used as outgoing label

Types of Label Distribution

“Roles” of a label-switching router

Downstream LSR: In direction of data flow

Upstream LSR: Against direction of data flow

Unsolicited downstream

Router generates label bindings as soon as it is ready to forward MPLS packets of the respective FEC

Upstream neighbors (according to IP routing): update forwarding tables

Label used as outgoing label

Non-upstream neighbors can store label for later use

Quicker reactions on route changes

Downstream on demand

Downstream router generates label binding on demand

Upstream router has to request label binding for FEC

Label Distribution Protocol

RSVP (Resource ReserVation Protocol)

🎯 Goal: bandwidth reservation for end-to-end data streams

Soft state principle

Establish a session and periodically signal that session is still alive

In case of failure state is automatically removed after some time

Signaling

Path message

From sender to receiver

Find path to receiver

Each hop is recorded in the message

Resv message

From receiver to sender

Bandwidth reservation on return path

RSVP-TE (Traffic Engineering)

Extension to RSVP to support label distribution

Many additional fields and functionality, e.g., fast reroute

Signaling

Path message

From upstream LER to downstream LER

Label request

Source route (“explicit route”) [optional]

Resv message

In response to path message

From downstream LER to upstream LER

Label binding (hop-per-hop)

Virtual Private Networks

MPLS is useful for virtual private networks (VPNs)

Use case: VPN traffic engineering

Customer with sites at different locations (e.g., different cities) wants to lease seamless “network” service

Requirements

Connect physically remote locations

Carry IP-based intranet traffic

Each customer has obtained an IP address block

Guaranteed bandwidth / SLAs

Options

“Dark fibre” provider

VPN backbone provider

Example: Private Networks over “Dark Fibre”

Suppose that three companies have sites at remote locations

Company A: Karlsruhe, Paris, Zürich

Company B: Karlsruhe, Paris

Company C: Karlsruhe, Paris

Each company runs a private network

Different subnet for each site from customers IP address space

Router connects site to other site(s)

Data is transported over leased fiber optic cables (“dark fibre”)

Capacity 155 Mbit/s, utilization marked in graph

A provider uses MPLS to offer virtual private networks

Has „points of presence (PoP)“ in all three cities

Offers bandwidth at arbitrary rates

Is cheaper than leasing fiber optic cables

Question: Can the provider serve the need of all three companies?

The answer is: YES! By utilizing non-shortest paths!

We can achieve that using VPNs implemented by Label Switching

Outer label: identifies path to LER

Inner label: identifies VPN instance / customer

For company A:

Inner label $5$: Indicates that this packet belongs to company A\

Outer labels $2, 7, 1$: Label switching/Swapping

For company B:

For company C:

Label Distribution

Recall VPN example from above

LSP for customer B (Karlsruhe $\rightarrow$ Paris) should take a “detour” over Zürich) to match bandwidth requirements

Setup of LSPs over explicitly given route with RSVP-TE

Example: LSP “Karlsruhe to Paris over Zürich”

RSVP-TE signaling initiated at upstream LER (LER-KA)

Note: LSPs are unidirectional!

How are the labels distributed?

LER-KA1 (upstream) sends Path Message to LER-P (downstream).

LER-P receives the Path Message and send Resv Message back.

Notice that we have label $2$ in the 5th step, and also in the 8th step. This is valid because labels are locally distributed.

Resource

MPLS - Multiprotocol Label Switching (2.5 layer protocol)

Differenzierensregeln für Matrizen

Fri, 17 Jun 2022 00:00:00 +0000

Für eine Matrix $\mathbf{C}$ gilt
$$ \frac{\partial}{\partial \mathbf{C}}\left(\underline{a}^{\top} \cdot \mathbf{C} \cdot \underline{b}\right)=\underline{a} \cdot \underline{b}^{\top} $$

Beispiel
$$ Q=\underbrace{\left[\begin{array}{ll} a_{1} & a_{2} \end{array}\right]}_{\boldsymbol{a}^\top}\left[\begin{array}{ll} c_{11} & c_{12} \\ c_{21} & c_{22} \end{array}\right]\underbrace{\left[\begin{array}{l} b_{1} \\ b_{2} \end{array}\right]}_{\boldsymbol{b}}=a_{1} b_{1} \cdot c_{11}+a_{2} b_{1} c_{21}+a_{1} b_{2} c_{12}+a_{2} b_{2} c_{22} = \boldsymbol{a} \cdot \boldsymbol{b}^\top $$ $$ \frac{\partial Q}{\partial \mathbf{C}}=\left[\begin{array}{ll} \frac{\partial Q}{\partial C_{12}} & \frac{\partial Q}{\partial C_{12}} \\ \frac{\partial Q}{\partial C_{21}} & \frac{\partial Q}{\partial C_{22}} \end{array}\right]=\left[\begin{array}{ll} a_{1} b_{1} & a_{1} b_{2} \\ a_{2} b_{1} & a_{2} b_{2} \end{array}\right]=\left[\begin{array}{l} a_{1} \\ a_{2} \end{array}\right]\left[\begin{array}{ll} b_{1} & b_{2} \end{array}\right] $$
Für eine symmetrische Matrix $\mathbf{C}$:

Mit $\underline{a}=\underline{e}$ und $\underline{b} = D \cdot \underline{e}$:
$$ \frac{\partial}{\partial \mathbf{C}} (\underline{e}^\top \mathbf{C} D \underline{e}) = \underline{e} \cdot \underline{e}^\top \cdot D^\top $$

Mit $\underline{a}=D \cdot \underline{e}$ und $\underline{b} = \underline{e}$:
$$ \frac{\partial}{\partial \mathbf{C}} (\underline{e}^\top D^\top \mathbf{C} \underline{e}) = D\cdot \underline{e}\cdot \underline{e}^\top $$

$$ \frac{\partial}{\partial \mathbf{K}}\left(\boldsymbol{a}^{\top} \cdot \mathbf{K} \cdot \mathbf{C} \cdot \mathbf{K}^{\top} \boldsymbol{b} \right)=\boldsymbol{a} \boldsymbol{b}^{\top} \mathbf{K} \mathbf{C}^{\top}+\boldsymbol{b} \boldsymbol{a}^{\top} \mathbf{K} \mathbf{C} $$

Seien $\boldsymbol{a} = \boldsymbol{e}, \boldsymbol{b} = \boldsymbol{e}$, $\mathbf{C}$ symmetrisch, dann gilt
$$ \frac{\partial}{\partial \mathbf{K}}\left(\boldsymbol{e}^{\top} \cdot \mathbf{K} \cdot \mathbf{C} \cdot \mathbf{K}^{\top} \boldsymbol{e} \right)=\boldsymbol{e} \boldsymbol{e}^{\top} \mathbf{K} \mathbf{C}^{\top}+\boldsymbol{e} \boldsymbol{e}^{\top} \mathbf{K} \mathbf{C} = 2\boldsymbol{e} \boldsymbol{e}^{\top} \mathbf{K} \mathbf{C} $$

Software Defined Networks (SDNs)

Thu, 11 Mar 2021 00:00:00 +0000

SDN summary

Basics and Architecture

High Level View on Traditional IP Networks

Abstract view on an IP router

Control plane

Exchange of routing messages for calculation of routes …

Additional tasks, such as load balancing, access control, …

Data plane: Forwarding of packets at layer 3

Every router has control and data plane functions

Control plane: software running on the router

Data plane

usually application-specific integrated circuits

Can also be realized in software (virtual switches)

Control is decentralized.

🔴 Limitations: Limited flexibility for network operators

Manufacturer-specific management interfaces

Difficult (and often impossible) to introduce new functions

Complex, highly qualified operators required

Expensive (at least for core routers)

Current Trend: Software-Defined Networks (SDN)

👍 Advantages

Increase flexibility

Decrease dependencies on hardware and manufactures

Commercial off-the-shelf switches (cheaper)

Characteristics

Separation of control plane and data plane

Control functionality resides on a logically centralized SDN controller

Controller is executed on commodity hardware $\rightarrow$ Reduces need for specialized routing hardware

Data plane consists of simple packet processors (SDN switches)

Control plane has global network view

Knows all switches and their configurations

Knows network topology

Network is software-programmable

Functionality is provided by network applications (network apps)

Different apps can realize different functionality

SDN controller can execute multiple apps in parallel

Processing is based on flows

Basic Operation

Control functionality is placed on the SDN controller

E.g., routing including routing table

Forwarding table is placed on SDN switch

Called flow table in the context of SDN

SDN controller programs entries in flow table according to its control functionality

Requires a protocol between controller and switch

For every incoming packet in the SDN switch

Suited entry in flow table needs to be determined

Flows and Flow Table

Flows: sequence of packets traversing a network that share a set of header field values

Here: Identified through match fields, e.g., IP address, port number

Flow table contains, among others, match fields and actions

Matches select appropriate flow table entry

Actions are applied to all packets that satisfy a match

E.g.

Flow rule

Decision of controller

Described in form of match fields, actions, switches

Flow Rule and Flow Table Entries

Controller (more precise: app executed by controller) makes a high level decision, for example

a) Traffic for destination X has to be dropped

b) Connection between end system A and B has to go through switch S4

c) …

High level decision is represented in a certain format, i.e., as a set of flow rules in the form of match fields, actions and switches

Flow rules are transmitted (“installed”) to switches with the help of a communication protocol. They are stored as flow table entries in flow tables

Flow Programming

SDN provides two different modes

Proactive flow programming

Flow rules are programmed before first packet of flow arrives

Reactive flow programming

Flow rules are programmed in reaction to receipt of first packet of a flow

Three Important Interactions

Example: Proactive Flow Programming

Scenario

Details

Example: Reactive Flow Programming

Same scenario as above

Details

Proactive vs. Reactive Flow Programming

Flow Programming Characteristics Delay? Loss of controller connectivity

Proactive coarse grained, pre-defined No Does not disrupt traffic

Reactive fine grained, on demand Yes New flows cannot be installed

Proactive

Flow table entries have to be programmed before actual traffic arrives

Usually coarse grained pre-defined decisions

Not always applicable 🤪

No additional delays for new connections

Loss of controller connectivity does not disrupt traffic

Reactive

Allows fine grained on-demand control

Increased visibility of flows that are active in the network

Setup time for each flow $\rightarrow$ High overhead for short lived flows

New flows cannot be installed if controller connectivity is lost

SDN Architecture

Application Plane

Network apps perform network control and management tasks

Interacts via northbound API with control plane

Control Plane

Control tasks are „outsourced“ from data plane to logically centralized control plane

E.g., standard tasks such as topology detection, ARP …

More complex tasks can be delegated to application plane

E.g., routing decisions, load balancing …

Data Plane

Responsible for packet forwarding / processing

SDN switches are relatively simple devices

Efficient implementations in hardware (ASIC) or in software (virtual switches)

Supports basic operations such as match, forward, drop

Interacts via southbound API with control plane

Interfaces

Northbound API: between controller and network apps

Exposes control plane functions to apps

Abstract from details, apps can operate on consistent network view

Southbound API: between controller and switches

Exposes data plane functions to controller

Abstracts from hardware details

Westbound API: between controllers

Synchronization of network state information

E.g., coordinated flow setup, exchange of reachability information

Eastbound API: interface to legacy infrastructures

Usually proprietary

SDN Workflow in Practice

Workflow and Primitives

High level view:

In practice:

We need a piece of software (app) that realizes the new behavior

control_my_network.java

We need primitives to assist with creating the app

import OFMatch, OFAction, ...

We need a runtime environment that can execute our app

$ ./myController --runApp control_my_network.java

We need hardware support for SDN in the switches

Flow table(s)

Primitives for SDN Programming

🎯 Goal: From intended behavior to lower level flow rules

$\rightarrow$ This requires SDN programming primitives

Three important areas to cover

(1) Create and install flow rules

Sufficient for proactive use cases.

Example: Traffic with IP destination address 1.2.3.4 has to be forwarded to network B by switch S1

Needed: App that implements the corresponding logic

Represent the decision as flow rules

Program appropriate flow table entries into the switch

Suppose that we have static_forwarding.java

Creates a new flow rule

Sends the flow rule to S1

Details

Here we use a simple pseudo programming language

Language used in practice depends on controller

Different controllers support different languages: Java, Python, C, C++, …

Overview: Matches

Overview: Actions

Priorities

Priorities come into play if there are overlapping flow rules

No overlap = all potential packets can only be matched by at most one rule

Overlap = at least one packet could be matched by more than one rule

Example

Assume that all rules are created with same default priority (=1)

If two rules can overlap, priority has to be changed explicitly

Higher values = higher priority

Multiple Flow Tables

SDN switches can support more than one flow table

Using multiple tables has several benefits

Can be used to isolate flow rules from different apps

Logical separation between different tasks (one table for monitoring, one table for security, …)

In some situation: less overall flow table entries

Similar to single table case

r.TABLE(x): specify the table for this rule

r.ACTION('GOTO', y): specify processing continues in another table

Avoid cycles: Can NOT go to lower flow table number

GOTO from table x to table y $\Rightarrow$ y > x

Example

(2) React to data plane events

onPacketIn(packet, switch, inport)

Called if the controller receives a packet that was forwarded via r.ACTION('CONTROLLER')

Parameters

packet: contains packet that was forwarded and grants access to its header fields

packet.IP_SRC

packet.IP_DST

packet.MAC_SRC

packet.MAC_DST

packet.TTL

…

switch: the switch the packet was received at (e.g., S1)

inport: the interface the packet was received at (e.g., port 1)

Example

Sketch

Create a low priority flow rule that sends „all unknown packets“ to the controller

r.MATCH('*') // match on everything r.ACTION('CONTROLLER') // send packet to controller r.PRIORITY(0) // use lowest priority for this flow rule

Use onPacketIn() to create and install flow rules on demand

Details

(3) Inject individual packets

Handle individual packets from within the app

Forward a packet that was sent to the controller

Perform topology detection

Active monitoring („probe packets“)

Answer ARP requests

send_packet(packet, switch, rule)

Injects a single packet into a switch

Parameters

packet: contains the packet that should be injected

switch: the switch where the packet is injected

rule: a flow rule that is applied to this packet instead of default flow table

processing (optional)

Only rule.ACTION() is allowed here

No matches, no priorities

Different from installing flow rules

Used for a single packet only

The flow table is not changed

Even if the rule parameter is present, this does NOT create a new flow table entry

Inject and process injected packet with a custom rule

Directly attaches the actions to the injected packet

Rule is only used for a single packet

Flow table remains unchanged

Advantages

Efficient

Consistent

Example

newPacket = createNewPacket() customRule = Rule() customRule.ACTION('OUTPUT', 1) send_packet(newPacket, switch, customRule)
Summary on Primitives

Entry point primitves: Callbacks to implement custom logic

onConnect(switch)

Called if a new control connection to switch is established

onPacketIn(packet, switch, port)

Called if a packet was forwarded to the controller

Flow rule creation primitives: Used to define flow rules

Rule.MATCH()

Select packets based on certain header fields

Rule.ACTION()

Specify what happens to a packet in the switch

Rule.PRIORITY()

Specify the priority of the created flow rule

Rule.TABLE()

Specify the flow table the rule should be applied to

Switch interaction primitives: Used to handle flow rule installation and packet injection

send_rule(rule, switch) Installs a flow rule and creates the associated flow table entry in the switch

send_packet(packet, switch) Injects a single packet into a switch, process with existing flow table entries

send_packet(packet, switch, rule) Injects a single packet into a switch, process with custom rule

Learning Switch Example

Goal: learn port-address association of end systems

Switch receives packet and does not know destination address

Floods packets on all active ports

Learns “location” of the end system with this destination

address

- Remembers that end system is accessible via this port - Entry in table `<MAC address, port, lifetime>`

Switch receives packet and knows destination address

Forwards packet via corresponding port

We can do the same with SDN: Learning switch app

Observe packets by controller

Derive locations of end systems

Program forwarding rules to allow connectivity between end systems based on MAC addresses and port numbers

Naïve Approach

Send all packets to controller

Controller looks at INPORT and source MAC address

Controller creates rules based on these two pieces of information

Packets with unknown destination addresses are flooded to all ports

Implementation

🔴 Problem

Version 2

Delay rule installation until the destination address was learned (not the source address)

Avoids installing rules „too early“

Implementation

Consider the example above:

🔴 Problem

Version 3

Only matching on destination address is not specific enough

Use more specific matches

Makes sure that all end systems can be learned by controller

Implementation

Consider the example in Version 2:

🔴 Problem: flow table resources

Needs N*N flow entries for N end systems

May exceed table capacity! 🤪

The amount of flow table entries required is an important factor for usability and scalability.

Version 4

Separate flow tables for learning and forwarding

Flow table FT1 matches on source address and forwards to controller, if address was not yet learned

Flow table FT2 matches on destination address and forwards packet to destination (if learned) or floods packet (if not learned)

Only 2*N rules for N end systems

🔴 Problem: Hardware often does not support multiple flow tables due to cost, energy or space constraints

OpenFlow

Rough Overview

A standard for an SDN southbound interface

Defines the interaction between controller and switches

Defines a logical architecture for SDN switches (flow table, …)

Defined by the Open Networking Foundation (ONF)

Supports

All basic structures and primitives discussed in previous section

Matches

Actions

Priorities

Multiple flow tables

Protocol mechanisms for

Creating flow rules

Reacting to data plane events

Injecting individual packets

More sophisticated features

Group table

Rate limiting

Structure

Provides a uniform view on SDN-capable switches

Ports

Represent logical forwarding targets

Can be selected by the output action

Physical ports = hardware interfaces

Reserved ports (special meaning)

ALL

Represents all ports eligible to forward a specific packet (= flooding);

Ingress port is automatically excluded from forwarding

IN_PORT

Always references ingress port of a packet (= send packet back the way it came)

CONTROLLER

Forwarding a packet on this port sends it to the controller

NORMAL

Yields control of the forwarding process to the vendor-specific switch implementation

Logical ports

Provide abstract forwarding targets (vendor-specific)

Link aggregation: Multiple interfaces are combined to a single logical port

Transparent tunneling: Traffic is forwarded via intermediate switches

Flow table

Counters

The number of processed packets (counter)

Timeouts

Maximum lifetime of a flow

Enables automatic removal of flows

Cookie

Marker value set by an SDN controller

Not used during packet processing

Simplifies flow management

Flags

Indicate how a flow is managed

E.g., notify controller when a flow is automatically removed

Pipeline Processing

Multiple flow tables can be chained in a flow table pipeline

Flow tables are numbered in the order they can be traversed by packets

Processing starts at flow table 0

Only “forward” traversal is possible $\rightarrow$ no recursion

Actions are accumulated in an action set during pipeline processing

Divided into ingress and egress processing

Example

Building an action set

Ingress Processing

Starts at flow table 0

Initial action set is empty

Egress Processing

Optionally follows ingress or group table processing

Egress flow tables must have higher table numbers than ingress tables $\rightarrow$ No return to ingress processing

Group Tables

Grout entry:

Group tables represent additional forwarding methods (E.g., link selection, fast failover, …)

Group entries can be invoked from other tables via group actions

They are referenced by their unique group identifier

Flow table entries can perform group actions during ingress processing

Effect of group processing depends on the group type and its action buckets

Action buckets

Each group references zero or more action buckets

Not every action bucket of a group has to be executed

A group with no action buckets drops a packet

An action bucket contains a set of actions to execute (just like an action set)

Group types

All: executes all buckets in a group (E.g., for broadcast)

Indirect: executes the single bucket in a group

Indirect groups must reference exactly one action bucket

Useful to avoid changing multiple flow table entries with common actions

Select: selects one of many buckets of a group (E.g., select by round-robin or hashing of packet data)

Fast failover: executes first live bucket in a group

Each bucket is associated with a port that determines its liveliness

Example

Indirect Group Tables

🎯 Goal: Reroute flows to avoid forwarding via switch S2

Output ports specified in flow tables are subject to change

SDN controller must send multiple modify-state messages to SDN switches

One message for each flow that needs to be updated 🤪

Optimization

Use an indirect group to avoid sending multiple modify-state messages

Redirect flows with identical forwarding behavior to that group

Modify the groups actions when forwarding behavior changes

Advantage: Instead of modifying a great number of entries in flow table, we just need to modify one entry in group table!

Additional material on OpenFlow

Flow Table in OpenFlow

Flow tables contain match/action-associations

Matches select the appropriate flow table entries

Actions are applied to all packets that satisfy a match

Table-miss flows capture all unmatched packets

Enables reactive flow programming

Corresponding flow table entry has lowest priority

Synonym: default flow

Example

Matches in OpenFlow

Matches have priorities

Only the entry with the highest priority is selected

Disambiguation of similar match fields

Wildcard matching can be performed using bitmasks

Empty match fields match all flows

Actions in OpenFlow

Basic functionality is simple: „determine what happens to a packet“

In reality, OpenFlow makes a distinction between actions, action sets and more general instructions (linked to how the OpenFlow pipeline works)

Action

A concrete command to manipulate packets like „output on port“ or „push MPLS“

OpenFlow supports

Output: forwards a packet

Set-field: modifies a header field of a packet

Push-tag: pushes a new tag onto a packet

Pop-tag: removes a tag from a packet

Drop a packet: Implicitly defined when no output action is specified

Action set

Every packet has its own ActionSet while processed

Changes to the packet can be stored in the set / deleted from the set

Actual changes are applied when processing ends

Set is carried between flow tables (in one switch)

An action set contains at most one action of a specific type

Previous instances are overwritten

An action set may contain multiple set-field actions

Execution proceeds in a well-defined order

Modifications to the action set

write-actions: writing new actions to a set

clear-actions: Removing all actions from the set

(Check out the example in Flow Table)

Instructions

Control how packets are processed in the switch

Each flow table entry is associated with a set of instructions

Change the packet immediately (apply-action)

Change the action set

Continue processing in another table (goto-table command)

OpenFlow Channel

Connects each switch to a controller

Provides the southbound API functionality of an OpenFlow switch

Management and configuration of switches by controllers

Signaling of events from switches to controllers

Monitoring of liveliness, error states, statistics, …

Experimentation

Multiple channels to different controllers can be established

Three message types

Controller-to-Switch messages

Inject controller-generated packets (packet-out message)

Modify port properties or switch table entries (modify-state message)

Collect runtime information (read-state message)

Asynchronous messages

Packet-in message transfers control of packet to the controller

State changes signaled by switches

Symmetric messages

Handle connection setup and ensure correct operation

Hello: exchanged on connection startup (e.g., indicate supported versions)

Echo: verify lifelines of controller-switch connections

Error: indicate error states of the controller or switch

Experimenter messages can offer additional functionality

Meter Tables

Meter table entry

Meters measure and control the rate of packets and bytes

They are managed in the meter table

Each meter has a unique meter identifier

Meters are invoked from flow table entries through the meter action

When invoked, each meter keeps track of the measured rate of packets

One of several meter bands is triggered when the measured rate exceeds that bands target rate

Meter bands

Packet processing by a meter band depends on its band type

DSCP remark: implements differentiated services

Drop: implements simple rate-limiting

Rate and burst determine when a band is executed

Band types may have additional type-specific arguments

The Power of Abstraction

Different Abstractions for Different Apps

Controller can provide different abstractions to network apps

Apps should not deal with low level / unnecessary details

Apps only have an abstract view of the network

Global view of controller can be different from abstract view of an app

Examples

“Big Switch Abstraction”

Consider a security application that manages access control lists

Controls the access of end systems E1, … En to services S1, … Sm

Details such as the exact position of an end system / service are not required for the application $\rightarrow$ Can be hidden in the abstraction

Network Slicing

Consider a network that has to be virtualized between multiple customers, e.g., Alice and Bob

Alice is only allowed to utilize S1, S2, and S3

Bob is only allowed to utilize S2, S3, S5, and S6

Both customers get an individual (full-meshed) view of the network

This is often called a network slice

🔴 SDN Challenges

Controller connectivity

SDN requires connectivity between controller and switches

Two different connectivity modes

Out-of-band

Dedicated (physical) control channel for messages between controller and switch

Cost intensive

In-band

Control messages use same channel as “normal” traffic (data)

Multiple applications can configure switch

Scalability

Logically centralized approach requires powerful controllers $\rightarrow$ Size / load of bigger networks can easily overload control plane 🤪

Important parameters with scalability implications

Number of remotely controlled switches

Number of end systems / flows in the network

Number of messages processed by controller

Communication delay between switches and controller

Possible solution: Distributed controllers

Consistency

Network view must remain consistent for applications

Synchronize network state information

Done via the westbound interface

Controller directly applies internal operations (inside partition) and notifies remote controllers of relevant changes of the network

E.g C1 applies internal operations in Partition 1 and then notifies C2 of the change.

Apps can perform data plane operations on remote switches

Apps operate on a consistent network view

Operations are delegated to responsible SDN controller

Note: Control plane with multiple controllers is a distributed system

Desirable properties

Consistency

System responds identically to a request no matter which node receives the request (or does not respond at all)

Availability

System always responds to a request (although response may not be consistent or correct)

Partition tolerance

System continues to function even when specific messages are lost or parts of the network fail

CAP theorem

It is impossible to provide (atomic) consistency, availability and partition tolerance in a distributed system all at once

Only two of these can be satisfied at the same time

Data plane limitations

Flow Table Capacity

Flow Setup Latency

SDN Use Cases

Google B4

Defense4All

VMWare NSX

Tools

Controller Platforms

Virtual Switches

Core component in modern data centers

Used as “virtual” Top-of-Rack switches

Flow Programming Example

This example is taken from HW09.

Describe the functionality that is implemented by app_1.java

The application has proactive and reactive parts

Proactive: onConnect()

r1: Forward all packets whose IP destination address belongs to 28.0.0.0/8 to port 1 (i.e. network N1)

r2: Default rule, drops everything

r3: Send packets from N1 to controller

Reactive onPacketIn()

If a packet is sent to controller by r3, check whether the MAC address is valid. If valid, then forward to port 4 (i.e, network N2). Otherwise drop.

What port is connected to the Internet in the given example?

A reasonable assumption here is that N1 is the internal network (because the application can check source validity with MAC addresses) and N2 is the Internet (i.e., the answer is port 4)

Why r2.PRIORITY(0) is required?

r2.PRIORITY(0) is required, because r2 is the default rule in this case

Default rules usually have * match

0 is the lowest priority (lower than the default priority = 1)

why r1.PRIORITY(2) is required?

r1.PRIORITY(2) is required to enforce that the there are no rule overlaps

With default priority on r1 , r1 and r3 would overlap if a packet from N1 is sent with destination address in 28.0.0.0/8

Draw a sequence diagram illustrating the processing of the six consecutive packets P1 - P6 shown below. The diagram should contain the two networks (N1, N2), the switch (S) and the controller (C). Mark the arrows with send_rule, packet_in and send_packet

Solution

Network Function Virtualization (NFV)

Sun, 14 Mar 2021 00:00:00 +0000

Network Functions

Middleboxes and Network Functions

Middlebox

Device on the data path between a source and destination end system

Performs functions other than normal, standard functions of an IP route

Network function

Functionality of a middlebox

Executed on the data path

E.g. Network address translation (NAT), firewall, proxy, load balancing, intrusion detection, …

Network Address Translation (NAT)

Connects a realm with private addresses to an external realm with globally unique addresses

Problem: private addresses cannot be used for routing in the Internet

Solution: Exchange globally unique and private addresses when packets traverse network boundaries

$\rightarrow$ Clients in the private address range can share globally unique addresses

Example

Firewall

Monitors and controls incoming and outgoing traffic

Establishes barrier between trusted and untrusted networks

Forwards or drops packets based on pre-defined rule set

Variants. e.g.

Shallow vs. deep packet inspection

Shallow: decisions are based on header fields only (e.g., IP and TCP protocol information)

Deep: inspects content of higher layer protocols (e.g., detection of malware traffic in application layer protocols)

Stateful vs. stateless processing

Stateless: every packet is inspected independently of other packets

Stateful: keeps state between packets (e.g., for every TCP connection to detect invalid sequence numbers)

Traditional Middlebox Deployment

Example: Caching

Single content provider

Multiple content providers

Place multiple middleboxes at different locations in the network

🔴 Problems

Middleboxes are often build as proprietary hardware

Fast, but very inflexible

Usually closed sourceblackbox for infrastructure operator

Static wiring

Hard to setup / tear down

Hard to move

Hard to upgrade $\rightarrow$ introduce new or bigger boxes

Network operators have to manage many different vendor-specific boxes

Network Function Virtualization (NFV)

💡Mimic ideas of cloud computing

Implement network functions in software

Use virtualization technology to decouple network functions from hardware

Consolidate functionality on high volume servers, switches and storage

Network services combine multiple network functions

End-to-end behavior of a network service is the combination of the individual network functions

👍 Benefits

Resource sharing: Single platform for different applications and users

Agility and flexibility: Services can scale to address changing demands

Rapid deployment and innovation cycles: Providers can easily trial and evolve services

Reduced costs

Consider the caching example above: Networks provide infrastructure for executing software-based network functions (NFV Infrastructure, NFVI)

Main Building Blocks of NFV

Virtualized Network Functions (VNFs)

The actual network functions provided in software

Independent of its deployment (e.g., hardware)

NFV Management and Orchestration (MANO)

Lifecycle management of VNFs and network services

Requests resources for VNFs

NFV Infrastructure (NFVI)

Provides hardware, software and network resources for VNFs

Decouples VNFs from underlying hardware

Can contain multiple Points of Presence (PoP)

Small data centers, located at different points in the infrastructure

SDN is used to transparently reroute flows to PoPs

Could also be done with MPLS or other technologies

SDN and NFV complement each other very well

Simple deployment example

Virtualization

Provides a software abstraction layer between

Hardware and

Operating system and applications running in a virtual machine

$\rightarrow$ Offers a standardized platform for applications

The abstraction layer is referred to as hypervisor

“Resource broker” between hardware and virtual machines

Translates I/O from virtual machines to physical server devices

Allows multiple operating systems to coexist on a single physical host

Allows live migration of virtual machines to other hosts

Type 1 Hypervisor

Runs directly on hardware

High performance

Strong isolation between virtual machines

Synchronizes the access of virtual machines to the hardware

Type 2 Hypervisor

Runs on top of a host operating system

Hypervisor is executed as an application in user space

Virtual machines provide virtual hardware to guest operating systems

Interaction with virtual hardware is directed to physical devices through a virtual machine driver or the host operating system

Container-Based Virtualization

Single kernel provides multiple instances (containers) of same host operating system

No hypervisor involved

Isolation of containers is enforced by host operating system kernel

Each container has its own view of the operating system

Applications in containers are executed by the host operating system

$\rightarrow$ Applications depend on host operating system

Kernel synchronizes access of containers to the hardware

Service Function Chaining (SFC)

Ordered set of network functions

Specifies ordering constraints that must be applied to flows

Enables the creation of composite network services

Transparent to end systems

Examples

Firewall $\rightarrow$ authentication server

Load balancer $\rightarrow$ cache

…

Example: Advanced Caching Scenario

Place additional firewall, authentication and cache on the data path

Sketch

Required VNFs are instantiated at appropriate PoPs

Service function chain is established (flow table entries in the data plane)

$\rightarrow$ Flow table entries enforce correct order of VNF traversal

MPLS-based Service Function Chaining

Service classifiers select appropriate service function chains (step 1)

Select traffic to be processed in the chain

Attach a stack of MPLS labels to packets to determine their path through the chain

Service function forwarders deliver packets to network functions

The service function indicated by the topmost MPLS label is applied

The topmost label is removed from the stack afterwards

(step 2 - 4)

Normal traffic flow resumes when the MPLS stack is empty (step 5)

🔴 Challenges

Security

VNF performance

VNF placement

Reliability

Testing and debugging

Carrier grade requirements Existence with legacy networks

…

HMM und Wonham Filter

Wed, 29 Jun 2022 00:00:00 +0000

Das Hidden Markov Model (HMM) ist ein stochastisches Modell, in dem ein System durch eine Markowkette mit unbeobachteten Zuständen modelliert wird.

Die Modellierung als Markowkette bedeutet, dass das System auf zufällige Weise von einem Zustand in einen anderen übergeht, wobei die Übergangswahrscheinlichkeiten nur jeweils vom aktuellen Zustand abhängen, aber nicht von den davor eingenommenen Zuständen.

Ein HMM besteht aus

Systemmodell / Übergangswahrscheinlichkeiten / Transitionsmatrix $\mathbf{A}$

Messmodell / Emissionswahrscheinlichkeiten / Messmatrix $\mathbf{B}$

Zustandsraum; Zustandswahrscheinlichkeiten $\xi_{k}^{\boldsymbol{x}}$

Messungen; Emissionswahrscheinlichkeiten $\xi_{k}^{\boldsymbol{y}}$

Initialer Zustand $x_0$ oder initiale Zustandswahrscheinlichkeit $\xi_{0}^{\boldsymbol{x}}$

Beispiel (Übungsblatt 4.2)

Zustandsraum
$$ \begin{aligned} S &=\{\text { Sonniger Tag }\} \\ R &=\{\text { Regnerischer Tag }\} \\ N &=\{\text { Nebliger Tag }\} \end{aligned} $$

Zustandsvektor
$$ \xi_{k}^{\boldsymbol{x}}=\left[\begin{array}{l} \mathrm{P}\left(\boldsymbol{x}_{k}=S\right) \\ \mathrm{P}\left(\boldsymbol{x}_{k}=R\right) \\ \mathrm{P}\left(\boldsymbol{x}_{k}=N\right) \end{array}\right] $$

Transiitonsmatrix
$$ \mathbf{A}=\left[\begin{array}{lll} 0.7 & 0.2 & 0.1 \\ 0.2 & 0.6 & 0.2 \\ 0.4 & 0.3 & 0.3 \end{array}\right] $$

Messwerte
$$ \begin{array}{l} d=\{\text { dreckige Schuhe }\} \\ s=\{\text { saubere Schuhe }\} \end{array} $$

Messvektor
$$ \underline{\xi}_{k}^{\boldsymbol{y}}=\left[\begin{array}{l} \mathrm{P}\left(\boldsymbol{z}_{k}=d\right) \\ \mathrm{P}\left(\boldsymbol{z}_{k}=s\right) \end{array}\right] $$

Messmatrix
$$ \mathbf{B}=\left[\begin{array}{ll} 0.1 & 0.9 \\ 0.8 & 0.2 \\ 0.4 & 0.6 \end{array}\right] $$

Initiale Zustandswahrscheinlichkeit $\xi_{0}^{\boldsymbol{x}}$ und initialer Zustand $x_0$
$$ \underline{\xi}_{0}^{\boldsymbol{x}}=\left[\begin{array}{l} 1 \\ 0 \\ 0 \end{array}\right] ; \quad x_{0}=S $$

Modell als Zustandsdiagramm mit Übergangswahrscheinlichkeiten

Wonham-Filter

Das Wonham Filter ist ein rekursives Filter für Zustandschätzung für wertdiskrete Systeme.

Das Wonham Filter besteht aus zwei Phasen

Prädiktion
$$ \underline{\xi}_{k \mid 1: k-1}^{x}=\mathbf{A}_{k}^{\top} \underline{\xi}_{k-1\mid1: k-1}^{x} $$

$\mathbf{A}_k$ : Transitionsmatrix

$\underline{\xi}_{k-1\mid1: k-1}^{x}$ : letzte Zustandsschätzung

Filterung

Für Messung $y_k = m$:
$$ \underline{\xi}_{k \mid 1: k}^{x} =\frac{\mathbf{B}(:, m) \odot \xi_{k \mid 1: k-1}^{x}}{\mathbb{1}_{N}^{T} \operatorname{diag}(\mathbf{B}(:, m)) \cdot \xi_{k \mid 1: k-1}^{x}} =\frac{\mathbf{B}(:, m) \odot \xi_{k \mid 1: k-1}^{x}}{\mathbf{B}(:, m)^\top \cdot \xi_{k \mid 1: k-1}^{x}} $$

(Mehr über Wonham filter siehe hier)

Beispiel (weiter)

Zeitpunkt $k=1$:
$$ \begin{array}{l} \underline{\xi}_{1}^{p}=\mathbf{A}^{\top} \underline{\xi}_{0}^{\boldsymbol{x}}=\left[\begin{array}{l} 0.7 \\ 0.2 \\ 0.1 \end{array}\right] \\\\ \underline{\xi}_{1}^{e}=\frac{\mathbf{B}(:, 1) \odot \underline{\xi}_{1}^{p}}{\mathbf{B}(:, 1)^{\top} \underline{\xi}_{1}^{p}}=\frac{\left[\begin{array}{l} 0.1 \\ 0.8 \\ 0.4 \end{array}\right] \odot\left[\begin{array}{l} 0.7 \\ 0.2 \\ 0.1 \end{array}\right]}{\left[\begin{array}{lll} 0.1 & 0.8 & 0.4 \end{array}\right]\left[\begin{array}{l} 0.7 \\ 0.2 \\ 0.1 \end{array}\right]}=\frac{\left[\begin{array}{l} 0.07 \\ 0.16 \\ 0.04 \end{array}\right]}{0.27}=\left[\begin{array}{l} 0.25926 \\ 0.59259 \\ 0.14815 \end{array}\right] \end{array} $$
$P(\boldsymbol{x}_1 = R) = 0.59259$ ist die größst in $\underline{\xi}_{1}^{e}$. $\Rightarrow$ Die Schätzung deutet auf einen regnerischen Tag.

Zeitpunkt $k=2$:
$$ \begin{aligned} \underline{\xi}_{2}^{p} &=\mathbf{A}^{\top} \underline{\xi}_{1}^{e}=\left[\begin{array}{l} 0.35926 \\ 0.45185 \\ 0.18889 \end{array}\right] \\ \underline{\xi}_{2}^{e} &=\frac{\mathbf{B}(:, 1) \odot \xi_{2}^{p}}{\mathbf{B}(:, 1)^{\top} \xi_{2}^{p}}=\left[\begin{array}{l} 0.07596 \\ 0.76429 \\ 0.15975 \end{array}\right] \end{aligned} $$
$\Rightarrow$ Die Schätzung deutet auf einen regnerischen Tag.

Zeitpunkt $k=3$:
$$ \underline{\xi}_{3}^{p}=\mathbf{A}^{\top} \underline{\xi}_{2}^{e}=\left[\begin{array}{l} 0.26993 \\ 0.52169 \\ 0.20838 \end{array}\right] $$ $$ \xi_{3}^{e}=\frac{\mathbf{B}(:, 2) \odot \xi_{3}^{p}}{\mathbf{B}(:, 2)^{\top} \xi_{3}^{p}}=\left[\begin{array}{l} 0.51437 \\ 0.22091 \\ 0.26472 \end{array}\right] $$
$\Rightarrow$ Die Schätzung deutet auf einen sonnigen Tag.

Zeitpunkt $k=4$:
$$ \begin{array}{l} \underline{\xi}_{4}^{p}=\mathbf{A}^{\top} \underline{\xi}_{3}^{e}=\left[\begin{array}{ll} 0.510 & 13 \\ 0.314 & 84 \\ 0.175 & 04 \end{array}\right]\\ \xi_{4}^{e}=\frac{\mathbf{B}(:, 2) \odot \xi_{4}^{p}}{\mathbf{B}(:, 2)^{\top} \xi_{4}^{p}}=\left[\begin{array}{l} 0.73212 \\ 0.10041 \\ 0.16747 \end{array}\right] \end{array} $$
$\Rightarrow$ Die Schätzung deutet auf einen sonnigen Tag.

Beispiel (weiter)

Lösung:

Internet Congestion Control

Sun, 14 Mar 2021 00:00:00 +0000

TCP congestion control summary

Focus on

congestion control in the context of the Internet and its transport protocol TCP

implicit window-based congestion control unless explicitly stated differently

Basics

Shared (Network) Resources

General problem: Multiple users use same resource

E.g., multiple video streams use same network link

🎯 High level objective with respect to networks

Provide good utilization of network resources

Provide acceptable performance for users

Provide fairness among users / data streams

Mechanisms that deal with shared resources

Scheduling

Medium access control

Congestion control

…

Congestion Control Problem

Adjusts load introduced to shared resource in order to avoid overload situations

Utilizes feedback information (implicit or explicit)

“Critical” Situations

Example 1

Router concurrently receives two packets from different input interfaces which are directed towards the same output interface. $\rightarrow$ Only one of these packets can be sent at a time.

What to do with the other packet?

Buffer or

Drop

Example 2

Router has interfaces with different data rates

Input interface has high data rate

Output interface has low data rate

Two successive packets of a same or different senders arrive at input interface.

What to do with the second packet? The output interface is still busy sending the first packet while the second arrives.

Buffer or

Drop

Buffer

The terms buffer and queue are used interchangeably.

Routers need buffers (queues) to cope with temporary traffic bursts

Packets that can NOT be transmitted immediately are placed in the buffer

If buffer is filled up, packets need to be dropped 🤪

Buffers add latency

Typically implemented as FIFO queues

Router can only start sending a queued packet after all packets in front of it have been sent

Five green packets introduce queueing delay for blue packet

End-to-end latency of a packet includes

Propagation delay

Transmission delay

Queueing delay

General Problem

Sender wants to send data through the network to the receiver

On every network path, the link with the lowest available data rate limits the maximum data rate that can be achieved end-to-end

This link is called bottleneck link

The maximum data rate of a link is called link capacity

🔴 Problem: sender can send more data than bottleneck link can handle

Sender can overload bottleneck link! 🤪

$\rightarrow$ Sender has to adjust its sending rate

How to find the “optimal” sending rate?

Congestion Control vs. Flow Control

Flow control

Bottleneck is located at receiver side

Receiver can not cope with desired data rate of sender

Congestion control

Bottleneck is located in the network

Bottleneck link does not provide sufficient available data rate

Leads to congested router / intermediate system

Congestion Collapse

Throughput vs. Goodput

Throughput: Amount of network layer data delivered in a time interval

E.g., 1 Gbit/s

Counts everything including retransmissions

$\rightarrow$ the aggregated amount of data that flows through the router/link

Goodput: „Application-level“ throughput

Amount of application data delivered in a time interval

Retransmissions at the transport layer do NOT count

Packets dropped in transmission do NOT count

Observation

Load is small (below network capacity) $\rightarrow$ network keeps up with load

Load reaches network capacity (knee)

Goodput stops increasing, buffers build up, end-to-end latency increases

$\rightarrow$ Network is congested!

Load increases beyond cliff

Packets start to be dropped, goodput drastically decreases $\rightarrow$ Congestion collapse

Load refers to aggregated network layer traffic that is introduced by all active data streams. This includes TCP retransmissions.

Network capacity refers to maximum load that network can handle.

How Could Congestion Collapse Happen?

Congestion due to

Single TCP connection

Exceeds available capacity at bottleneck link

Prerequisite: flow control window is large enough

Multiple TCP connections

Aggregated load exceeds available capacity

Single TCP connection has no knowledge about other TCP connections

Knee and Cliff

Keep traffic load around knee

Good utilization of network capacity

Low latencies

Stable goodput

Prevent traffic from going over the cliff

High latencies

High packet losses

Highly decreased goodput

Challenge of Congestion Control

Challenge: Find “optimal” sending rate

Usually, sender has NO global view of the network

NO trivial answer

Lots of algorithms for congestion control

Types of Congestion Control

Window-based Congestion Control

Congestion Control Window (𝐶𝑊𝑛𝑑)

Determines maximum number of unacknowledged packets allowed per TCP connection

Assumes that packets are acknowledged by receiver

Basic window mechanism is similar to sliding window as applied for flow control purposes

Adjusts sending rate of source to bottleneck capacity $\rightarrow$ self-clocking

Rate-based Congestion Control

Controls sending rate, no congestion control window

Implemented by timers that determine inter packet intervals

High precision required

🔴 Problem: NO comparable cut-off mechanism, such as missing acknowledgements

Sender keeps sending even in case of congestion

Needed in case no acknowledgements are used

E.g., UDP

Implicit vs. Explicit Congestion Signals

Inplicit

Without dedicated support of the network

Implicit congestion signals

Timeout of retransmission timer

Receipt of duplicate acknowledgements

Round-Trip Time (RTT) variation

Explicit

Nodes inside the network indicate congestion

On the internet

Usually NO support for explicit congestion signals

Congestion control must work with implicit congestion signals only

End-to-end vs. Hop-by-hop

End-to-end

Congestion control operates on an end system basis

Nodes inside the network are NOT involved

Hop-by-hop

Congestion control operates on a per hop basis

Nodes inside the network are actively involved

Improved Versions of TCP

🎯 Goal

Estimate available network capacity in order to avoid overload situations

Provide feedback (congestion signal)

Limit the traffic introduced into the network accordingly

Apply congestion control

TCP Tahoe

TCP Recap

Connection establishment

3 way handshake $\rightarrow$ Full duplex connection

Connection termination

Separately for each direction of transmission

4 way handshake

Data transfer

Byte-oriented sequence numbers

Go-back-N

Positive cumulative acknowledgements

Timeout

Flow control (sliding window)

TCP Tahoe in a Nutshell

Mechanisms used for congestion control

Slow start

Timeout

Congestion avoidance

Fast retransmit

Congestion signal

Retransmission timeout or

Receipt of duplicate acknowledgements (𝑑𝑢𝑝𝑎𝑐𝑘)

$\rightarrow$ In case of congestion signal: slow start

The following must always be valid
$$ \text { LastByteSent }-\text { LastByteAcked } \leq \text { min\{CWnd, RcvWindow\} } $$

$\text{CWnd}$: Congestion Control Window

$\text{RcvWindow}$: Flow Control Window

Variables

$\text{CWnd}$: Convestion window

$\text{SSThres}$: Slow Start Threshold

Value of $\text{CWnd}$ at which TCP instance switches from slow start to congestion avoidance

Baisc approach: AIMD (additive increase, multiplicative decrease)

Additive increase of $\text{CWnd}$ after receipt of an acknowledgement

Multiplicative decrease of $\text{CWnd}$ if packet loss is assumed (congestion signal)

Initial values

$\text{CWnd}=1 \text{ MSS}$

$\text{MSS}$: Maximum Segment Size

Since RFC 2581: Initial Window $\text{IW} \leq 2 \cdot \text{MSS}$ and $\text{CWnd}=\text{IW}$

$\text{SSThres}$ initially set to “infinite”

Number of duplicate ACKs (congestion signal): 3

Algorithm

$\text{CWnd} < \text{SSThres}$ and ACKs are being received: slow start

Exponential increase of congestion window

Upon receipt of an ACK: $\text{CWnd } \text{+= } 1$

$\text{CWnd} \geq \text{SSThres}$ and ACKs are being received: congestion avoidance

Linear increase of congestion window

Upon receipt of an ACK : $\text{CWnd } \text{+= } 1/\text{CWnd}$

Congestion signal: timeout or 3 duplicate acknowledgements: slow start

Congestion is assumed

Set
$$ \text { SSThresh }=\max (\text { FlightSize } / 2, 2 * M S S) $$

$\text { FlightSize }$: amount of data that has been sent but not yet acknowledged

This amount is currently in transit

Might also be limited due to flow control

Set $\text{CWnd}=1 \text{ MSS}$ or $\text{CWnd}=\text{IW}$

On 3 duplicate ACKs: retransmission of potentially lost TCP segment

Example

Evolution of Congestion Window

Assumptions

No transmission errors, no packet losses

All TCP segments and acknowledgements are transmitted/received within single RTT

Flight-size equals CWnd

Congestion signal occurs during RTT

Initialize $\text{CWnd} = 1 \text{ MSS}$

The $\text{CWnd}$ grows in “slow start” mode. When $\text{CWnd} = 16$, a timeout error occurs.

This is a congestion signal. So we go back to “slow start”

Set $\text { SSThresh }=\max (\text { FlightSize } / 2, 2 * M S S)$

In this case, $\text{FlightSize} = 16$.

So$\text { SSThresh }=\max (16 / 2, 2) \text{ MSS} = 8 \text{ MSS}$

Set $\text{CWnd}=1 \text{ MSS}$ or $\text{CWnd}=\text{IW}$

Now $\text{CWnd} \geq \text{SSThres}$ $\rightarrow$ Switch to “congestion avoidance”!

When $\text{CWnd} = 12$, a timeout error occurs.

We just perform the same handling as above.

Fast Retransmit

Assume the following scenario

(Note: Not every segment that is received out of order indicates congestion. E.g., only one segment is dropped, otherwise data transfer is ok)

What would happen? Wait until retransmission timer expires, then retransmission

Waiting time is longer than a round trip time (RTT) $\rightarrow$ It will take a long time!🤪

Our goal is faster reaction

Retransmission after receipt of a pre-defined number of duplicate ACK

$\rightarrow$ Much faster than waiting for expiration of retransmission timer

Example: suppose pre-defined number of duplicate ACK is 3

TCP Reno

Differentiation between

Major congestion signal: Timeout of retransmission timer

Minor congestion signal: Receipt of duplicate ACKs

In case of a major congestion signal

Reset to slow start as in TCP Tahoe

In case of minor congestion signal

No reset to slow start

Receipt of duplicate ACK implies successful delivery of new segments, i.e., packets have left the network

New packets can also be injected in the network

In addition to the mechanisms of TCP Tahoe: fast recovery

Controls sending of new segments until receipt of a non-duplicate ACK

Fast Recovery

Starting condition: Receipt of a specified number of duplicate ACKs

Usually set to 3 duplicate ACKs

💡 Idea: New segments should continue to be sent, even if packet loss is not yet recovered

Self clocking continuous

Reaction

Reduce network load by halving the congestion window Retransmit first missing segment (fast retransmit)

Consider continuous activity, i.e., further received segments while no new data is acknowledged

Increase congestion window by number of duplicate ACKs (usually 3)

Further increase after receipt of each additional duplicate ACK

Receipt of new ACK (new data is acknowledged)

Set congestion window to its value at the beginning of fast recovery

In Congestion Avoidance

If timeout: slow start

Set $\text { SSThresh }=\max (\text { FlightSize } / 2, 2 * M S S)$

$\text{CWnd}=1$

If 3 duplicate ACKs: fast recovery

Retransmission of oldest unacknowledged segment (fast retransmit)

Set $\text { SSThresh }=\max (\text { FlightSize } / 2, 2 * M S S)$

Set $\text{CWnd} = \text{SSThresh} + 3\text{MSS}$

Receipt of additional duplicate ACK

$\text{CWnd } \text{+= } 1$

Send new, i.e., not yet sent segments (if available)

Receipt of a “new” ACK: congestion avoidance

$\text{CWnd} = \text{SSThresh}$

Evolution of Congestion Window with TCP Reno

Analysis of Improvements

After observing congestion collapses, the following mechanisms (among others) were introduced to the original TCP (RFC 793)

Slow-Start

Round-trip time variance estimation

Exponential retransmission timer backoff

Dynamic window sizing on congestion

More aggressive receiver acknowledgement policy

🎯 Goal: Enforce packet conservation in order to achieve network stability

Self Clocking

Recap: TCP uses window-based flow control

Basic assumption

Complete flow control window in transit

In TCP: receive window $𝑅𝑐𝑣𝑊𝑖𝑛𝑑𝑜𝑤$

Bottleneck link with low data rate on the path to the receiver

Basic scenario
![截屏2021-03-16 10.33.11](https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/截屏2021-03-16%2010.33.11.png)

Conservation of Packets

🎯 Goal: get TCP connection in equilibrium

Full window of data in transit

“Conservative”: NO new segment is injected into the network before an old segment leaves the network

$\rightarrow$ A system with this property should be robust in the face of congestion

Three ways for packet conservation to fail

Connection does not get to equilibrium

Sender injects new packet before an old packet has exited

Resource limits along the path hinder equilibrium

Slow Start

🎯 Goal: bring TCP connection into equilibrium

Connection has just started or

Restart after assumption of (major) congestion

🔴 Problem: get the „clock“ started (At the beginning of a connection there is no „clock“ available.)

💡 Basic idea (per TCP connection)

Do not send complete receive window (flow control) immediately

Gradually increase number of segments that can be sent without receiving an ACK

Increase the amount of data that can be in transit (“in-flight”)

Approach

Apply congestion window, in addition to receive window

Minimum of congestion and receive window can be sent

Congestion Window: $𝐶𝑊𝑛𝑑$ $[𝑀𝑆𝑆]$

Receive Window: $Rcv𝑊𝑖𝑛𝑑𝑜𝑤$ $[𝐵𝑦𝑡𝑒]$

New connection or congestion assumed

$\rightarrow$ Reset of congestion window: $𝐶𝑊𝑛𝑑 = 1$

Incoming ACK for sent (not retransmitted) segment

Increase congestion window by one: $𝐶𝑊𝑛𝑑 = 𝐶𝑊𝑛𝑑 + 1$

$\rightarrow$ Leads to exponential growth of 𝐶𝑊𝑛𝑑

Sending rate is at most twice as high as the bottleneck capacity!

Retransmission Timer

Assumption: Complete receive window in transit

Alternative 1: ACK received

A segment was delivered and, thus, exited the network $\rightarrow$ conservation of packets is fulfilled

Alternative 2: retransmission timer expired

Segment is dropped in the network: conservation of packets is fulfilled

Segment is delayed but not dropped: conservation of packets NOT fulfilled

$\rightarrow$ Too short retransmission timeout causes connection to leave equilibrium

Good estimation of Round Trip Time (RTT) essential for a good timer value!

Value too small: unnecessary retransmissions

Value too large: slow reaction to packet losses

Estimation of Round Trip Time

Timer-based RTT measurement

Timer resolution varies (up to 500 ms)

Requirements regarding timer resolutions vary

SampleRTT

Time interval between transmission of a segment and reception of corresponding acknowledgement

Single measurement

Retransmissions are ignored

EstimatedRTT

Smoothed value across a number of measurements

Observation: measured values can fluctuate heavily

Apply exponential weighted moving average (EWMA)

Influence of each value becomes gradually less as it ages

Unbiased estimator for average value

$$ EstimatedRTT=(1-\alpha) * EstimatedRTT+\alpha * SampleRTT $$
(Typical value for $\alpha$: 0.125)

Derive value for retransmission timeout (RTO)
$$ 𝑅𝑇𝑂 = \beta ∗ 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑𝑅𝑇𝑇 $$

Recommended value for $\beta$: 2

Estimation of Deviation

🎯 Goal: Avoid the observed occasional retransmissions

Observation: Variation of RTT can greatly increase in higher loaded networks

Consequently, $EstimatedRTT$ requires higher “safety margin”

Estimation error: difference between measured/sampled and estimated RTT

Computation
$$ \begin{array}{l} &Deviation =(1-\gamma) * Deviation+\gamma * \left|SampleRTT- EstimatedRTT \right| \\\\ &RTO =EstimatedRTT +\beta * Deviation \end{array} $$

Recommended values: $\alpha = 0.125, \beta = 4, \gamma = 0.25$

Multiple Retransmissions

How large should the time interval be between two subsequent retransmissions of the same segment?

Approach: Exponential backoff

After each new retransmission RTO doubles:
$$ 𝑅𝑇𝑂 = 2 ∗ 𝑅𝑇𝑂 $$

Maximal value should be applied. It should be $$ 60 seconds

To which segment does the received ACK belong – to the original segment or to the retransmission?

Approach: Karn‘s Algorithm

ACKs for retransmitted segments are not included into the calculation of $EstimatedRTT$ and $Deviation$

Backoff is calculated as before

Timeout value is set to the value calculated by backoff algorithm until an ACK to a non-retransmitted segment is received

Then original algorithm is reactivated

Congestion Avoidance

Consider multiple concurrent TCP connections

Assumption: TCP connection operates in equilibrium

Packet loss is with a high probability caused by a newly started TCP connection

New connection requires resources on bottleneck router/link

$\rightarrow$ Load of already existing TCP connection(s) needs to be reduced

Basic components

Implicit congestion signals

Retransmission timeout

Duplicate acknowledgements

Strategy to adjust traffic load: AIMD

Additively increase load if no congestion signal is experienced

On acknowledgement received: $𝐶𝑊𝑛𝑑 += 1/𝐶𝑊𝑛𝑑$

Multiplicatively decrease load in case a congestion signal was experienced

- On retransmission timeout $$ CWnd = \gamma * CWnd, \quad 0< \gamma < 1

$$

- In TCP Tahoe: $\gamma = 1/2

Optimization Criteria

Basic Scenario

$𝑁$ sender that use same bottleneck link

Data rate of sender $i$: $r\_i(t)$

Capacity of bottleneck link: $C$

Bottleneck link: Link with lowest available data rate on the path to the receiver

Network-limited sender

Assume that the sender always has data to send and data are sent as quickly as possible

Sender can send a full window of data

Congestion control limits the data rate of such a sender to the available capacity at the bottleneck link

Application-limited sender

Data rate of the sender is limited by the application and not by the network

Sender sends less data as allowed by the current window

Efficiency

Closeness of the total load on the bottleneck link to its link capacity

$\sum\_{j=1}^{N} r\_{i}(j)$ should be as close to 𝐶 as possible, i.e., close to the knee

Overload and underload are not desirable

Fairness

All senders that share the bottleneck link get a fair allocation of the bottleneck link capacity

Examples

Jain ́s fairness index

Quantify „amount“ of unfairness
$$ F\left(r\_{i}, \ldots, r\_{N}\right)=\frac{\left(\sum r\_{i}\right)^{2}}{N\left(\sum r\_{i}^{2}\right)} $$

Fairness index $\in [0, 1]$

Totally fair allocation has fairness index of $1$ (i.e., all $r\_i$ are equal)

Totally unfair allocation has fairness index of $1/N$ (i.e., one user gets entire capacity)

Max-min fairness

Situation

Users share resource. Each user has an equal right to the resource

But: some users intrinsically demand fewer resources than others (E.g., in case of application-limited senders)

Intuitive allocation of fair share

Allocates users with a “small” demand what they want

Equally distributes unused resources to “big” users

💡 Max-min fair allocation

Resources are allocated in order of increasing demand

No source gets a resource share larger than its demand

Sources with unsatisfied demands get an equal share of the resource

Implementation

Senders $1, 2, ... 𝑁$ with demanded sending rates $s\_1, s\_2, ..., s\_N$

Without loss of generality: $s\_1 \leq s\_2 \leq ...\leq s\_N$

$C$: capacity

Give $\frac{C}{N}$ to sender with smallest demand

In case this is more than demanded, then $\frac{C}{N}− s\_1$ is still available to others

$\frac{C}{N} − s\_1$ equally distributed to others $\Rightarrow$ each gets $ \frac{C}{N} + \frac{\frac{C}{N} - s\_1}{N- 1}$

Example

Convergence

Responsiveness: Speed with which $r\_i$ gets to equilibrium rate at knee after starting from any starting state

May oscillate around goal (= network capacity)

Smoothness: Size of oscillations around network capacity at steady state

(Smaller is better in both cases)

On Fairness

How to divide resources among TCP connections?

$\rightarrow$ Strive for fair allocation 💪

🎯 Goal: all TCP connections receive equal share of bottleneck resource

the share should be non-zero

equal share is not ideal for all applications 🤔

Example: $𝑁$ TCP connections share same bottleneck, Each TCP connection receives $(1/𝑁)$-th of bottleneck capacity

Observation

“Greedy” user: opens multiple TCP connections concurrently

Example

Link with capacity $𝐷$, two users, one connection per user

$\rightarrow$ Each user gets capacity $\frac{D}{2}$

Link with capacity $𝐷$, two users, user 1 with a single connection, user 2

with nine connections

$\rightarrow$ User 1 can use $\frac{1}{10}D$ , user 2 can use $\frac{9}{10}D$

“Greedy” receiver

Can send several ACKs per received segment

Can send ACKs faster than it receives segments

Additive Increase Multiplicative Decrease

General feedback control algorithm

Applied to congestion control

Additive increase of data rate until congestion

Multiplicative decrease of data rate in case of congestion signal

$$ r_{i}(t+1)= \begin{cases} r_{i}(t)+a & \text { if no congestion is detected } \\\\ r_{i}(t) * b & \text { if congestion is detected } \end{cases} $$

Converges to equal share of capacity at bottleneck link

AIMD: Fairness

Network with two sources that share a bottleneck link with capacity $𝐶$

🎯 Goal: bring system close to optimal point $(\frac{𝐶}{2} , \frac{𝐶}{2})$

Efficiency line

$r\_1 + r\_2 = C$ holds for all points on the line

Points under the line means underloaded $\rightarrow$ Control decision: increase rate

Points above the line means overloaded $\rightarrow$ Control decision: decrease rate

Fairness line

All allocations with fair allocation, i.e. $r\_1 = r\_2$

Multiplying with $𝑏$ does not change fair allocation: $br\_1 = br\_2$

Optimal operating point

Intersection of efficiency line and fairness line: point $(\frac{𝐶}{2} , \frac{𝐶}{2})$

Optimality of AIMD

Additive increase

Resource allocation of both users increased by $\alpha$

In the graph: moving up along a 45-degree line

Multiplicative decrease

Move down along the line that connects to the origin

$\rightarrow$ Point of operation iteratively moves closer to optimal operating point 👏

Periodic Model

Performance metrics of interest

Throughput How much data can be transferred in which time interval?

Latency How high is the experienced delay?

Completion time How long until the transfer of an object/file is finished?

Variables

$X$: Sending rate measured in segments per time interval

$RTT$: Round trip time [seconds]

$p$: Loss probability of a segment

$MSS$: Maximum segment size [bit]

$W$: Value of a congestion window [MSS]

$D$: Data rate measured in bit per second [bit/s]

Periodic Model

Simple model – strong simplifications

🎯 Goals

Model long-term steady state behavior of TCP

Evaluate achievable throughput of a TCP connection under certain network conditions

Basic assumptions

Network has constant loss probability $p$

Observed TCP connection does not influence $p$

Further simplification: periodic losses

For an individual connection segment losses are equally spaced

$\rightarrow$ Link delivers $N = \frac{1}{p}$ segments followed by a segment loss

Additional simplifications / model assumptions

Slow start is ignored

Congestion window increases linearly (congestion avoidance)

RTT is constant

Losses are detected using duplicate ACKs (No timeouts)

Retransmissions are not modelled

Go-Back-N is not modelled

Connection only limited by $CWnd$

Flow control (receive window) is never a limiting factor

Always $MSS$ sized segments are sent

Under given assumptions we have the diagram:

Progress of CWnd: Perfect periodic saw tooth curve $$ \frac{W}{2}*MSS \leq CWnd \leq W * MSS $$ Note: Here $W$ is unitless.

Data rate when segment loss occurs?
$$ D = \frac{W * MSS}{RTT} $$

How long until congestion window reaches 𝑊 again?
$$ \frac{W}{2} * RTT $$

Average data rate of a TCP connection?
$$ D = \frac{0.75W * MSS}{RTT} $$

Step 1: Determine $W$ as a function of $p$

Minimal value of congestion window: $\frac{W}{2}$

Congestion window opens by one segment per RTT

Duration of a period: $$ t = \frac{W}{2} \text{ round trip times } = \frac{W}{2}*RTT \text{ seconds } $$

Number of delivered segments within one period

Corresponds to the area under the saw tooth curve
$$ N=\left(\frac{W}{2}\right)^{2}+\frac{1}{2}\left(\frac{W}{2}\right)^{2}=\frac{3}{8} W^{2} $$

According to the assumptions $N = \frac{1}{p}$

$\Rightarrow W = \sqrt{\frac{8}{3p}}$

Step 2: Determine data rate $D$ as a function of $p$

Average data rate
$$ D=\frac{N * M S S}{t} $$
We have assumption $N = \frac{1}{p}$ and period duration is $\frac{W}{2}*RTT$ [s]
$$ \Rightarrow D=\frac{\frac{1}{p} * M S S}{R T T * \frac{W}{2}} $$
In step 1 we have $W=\sqrt{\frac{8}{3 p}}$
$$ D=\frac{1}{R T T} \sqrt{\frac{3}{2 p}} * M S S $$
This is called “Inverse Square-Root $𝑝$ Law”

Example

Active Queue Management (AQM)

Simple Queue Management

Buffer in the router is full

Next segment must be dropped $\rightarrow$ Tail drop

TCP detects congestion and backs off

🔴 Problems

Synchronization: Segments of several TCP connections are dropped (almost) at the same time

Nearly full buffer cannot absorb short bursts

Active Queue Management

Basic approach

Detect arising congestion within the network

Give early feedback to senders

Intentionally trigger implicit congestion signal: packet loss

Alternative: Send explicit congestion notification (ECN)

Routers drop (or mark) segments, before queue completely filled up

Randomization: random decision on which segment to be dropped

Observations at the receiver on layer 4 Typically only a single segment is missing

AQM algorithms

Random Early Detection (RED)

Newer algorithms: CoDel, FQ-CoDel, PIE …

Random Early Detection

Approach

Average queue occupancy $< q\_{min}$

No drop of segments ($𝑝 = 0$)

$q\_{min} \leq$ average queue occupancy $< q\_{max}$

Probability of dropping an incoming packet is linearly increased with average queue occupancy

Average queue occupancy $ \geq q\_{max}$

Drop all segments ($𝑝 = 1$)

Explicit Congestion Notification

🎯 Goal: Send explicit congestion signal, avoid unnecessary packet drops

Approach

Enable AQM to explicitly notify about congestion

AQM does not have to drop packets to create implicit congestion signal

How to notify?

Mark IP datagram, but do not drop it

Marked IP datagram is forwarded to receiver

How to react?

Marked IP datagram is delivered to receiver instance of IP

Information must be passed to corresponding receiver instance of TCP

TCP sender must be notified

Additional Resource

TCP Congestion Control 👍

3.7 - TCP Congestion Control | FHU - Computer Networks 👍

Gaußverteilung

Sun, 03 Jul 2022 00:00:00 +0000

Skalarer Fall (1D)
$$ f(x)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left\{-\frac{1}{2} \frac{(x-\hat{x})^{2}}{\sigma^{2}}\right\} $$

Erwartungswert
$$ E_{f}\{x\}=\hat{x} $$

Varianz
$$ E_{f}\left\{(x-\hat{x})^{2}\right\}=\sigma^{2} $$

Given the parameters $\mu$ and $\sigma$ of a Gaussian density, mean and variance are already given. On the other hand, assume that we wish to approximate a given density $\tilde{f}_x$ with a simpler density of the same mean and standard deviation. Then, given the mean and the standard deviation of the density $\tilde{f}_x$, an appropriate Gaussian density is immediately obtained. This is a property not generally shared by more complicated densities.

2D Normalverteilung
$$ \begin{aligned} f_{x y}(x, y)&=\frac{1}{2 \pi \sigma_{x} \sigma_{y} \sqrt{1-r^{2}}} \exp \left\{-\frac{1}{2} Q(x, y)\right\} \\ Q(x, y)&=\frac{1}{1-r}\left\{\frac{(x-\hat{x})^{2}}{\sigma_{x}^{2}}-2 r \frac{x-\hat{x}}{\sigma_{x}} \frac{y-\hat{y}}{\sigma_{y}}+\frac{(y-\hat{y})^{2}}{\sigma_{y}^{2}}\right\} \end{aligned} $$

$r \in [-1, 1]$: Korrelationskoeffizent (in some literature also written as $\rho$)

Alternativ
$$ f_{x y}(x, y)=\mathcal{N} \left(\left[\begin{array}{l} x \\ y \end{array}\right],\left[\begin{array}{l} \hat{x} \\ \hat{y} \end{array}\right],\left[\begin{array}{ll} C_{x x} & C_{x y} \\ C_{y x} & C_{y y} \end{array}\right]\right) $$
mit
$$ \left[\begin{array}{ll} c_{x x} & c_{x y} \\ c_{y x} & c_{y y} \end{array}\right]=\left[\begin{array}{lc} \sigma_{x}^{2} & r \sigma_{x} \sigma_{y} \\ r \sigma_{x} \sigma_{y} & \sigma_{y}^{2} \end{array}\right] $$
Correlationskoeffizient

Correlation of bivariate Gaussian distribution ($\rho$ is the correlation coefficient). (Source: )

unkorreliert ($r = 0$) (Figure 1 right)

$\Rightarrow \boldsymbol{x}, \boldsymbol{y}$ unkorreliert

$\Rightarrow$ (nur für Gauß) $\boldsymbol{x}, \boldsymbol{y}$ unabhängig ($f_{\boldsymbol{x}, \boldsymbol{y}} = f_{\boldsymbol{x}}(x) f_{\boldsymbol{y}}(y)$ )

positiv korreliert ($r > 0$) (Figure 1 left)

positiv korreliert ($r < 0$) (Figure 1 middle)

$N$-dim. Normalverteilung
$$ f_{\boldsymbol{x}}(x)=\frac{1}{\sqrt{(2 \pi)^{N} \cdot|\mathbf{C}|}} \exp \left\{-\frac{1}{2}(\underline{x}-\underline{\hat{x}})^{\top} \mathbf{C}^{-1}(\underline{x}-\underline{\hat{x}})\right\} $$

$\underline{\hat{x}}$ : Mean

$\mathbf{C}$ : Kovarianzmatrix

Ethernet

Thu, 18 Mar 2021 00:00:00 +0000

Aloha, Slotted Aloha

Aloha

First MAC protocol for packet-based wireless networks

Media access control (MAC)

Time multiplex, variable, random access

NO previous sensing of medium and no announcement of intended transmission

Asynchronous access

🔴 Problem: Collision possible

Schema

Slotted Aloha

Like Aloha, but

Uses time slots

Synchronized access only at beginning of time slot

On average less collisions than with Aloha

Schema

Evaluation

How well can the capacity of the medium be utilized?

Evaluation of Slotted Aloha

Assumptions

Based on the design

All systems start transmissions at beginning of time slot

All systems work synchronized

Simplifications

All packets have same length and fit into one time slot

If a collision arises, all systems notice it before end of the time slot

All systems always want to send data

Every system sends in each time slot with a probability of $𝑝$

If a collision occurs

Packet will be repeated with a probability of $𝑝$ in all following time slots until the transmission is successful

There are $𝑁$ active systems in the network

Probability that a system starts sending: $𝑝$

Probability that $𝑁 − 1$ systems are not sending: $(1 - p)^{N-1}$

Probability that a given system succeeds: $p(1 - p)^{N-1}$

Probability for successful transmission of any one system: $Np(1 - p)^{N-1}$

Seeking for maximum utilization $U\_{max}$

Need $p^*$ s.t. $Np(1 - p)^{N-1}$ reaches its maximum

Solution: $p^\* = \frac{1}{N}$

Therefore:
$$ \begin{array}{l} &N p^{\*}\left(1-p^{\*}\right)^{N-1}=\left(1-\frac{1}{N}\right)^{N-1}\\\\ &\displaystyle{\lim \_{N \rightarrow \infty}}\left(1-\frac{1}{N}\right)^{N-1}=\frac{1}{e}\\\\ &U\_{\max }=\frac{1}{e}=0.36 \end{array} $$

Evaluation of Aloha

Simplifying assumptions

All packets have same length

Immediate notification about collisions

On collision: Packet will be repeated immediately with probability $𝑝$

On successful transmission

Wait for transmission time of packet

Then: continue sending with probability $𝑝$ and continue waiting with probability $1 − 𝑝$

Observation: Collision occurs

a) if previous packet from other system has not been send completely, or

b) if other system starts sending before ongoing transmission is finished

There are $𝑁$ active systems in the network

Probability that a system starts sending: $𝑝$

Probability for (a) and (b): $(1 - p)^{N-1}$

Probability for successful transmission of any one system: $Np(1 - p)^{2(N-1)}$

Further observations as for Slotted Aloha
$$ \begin{array}{l} \displaystyle{\lim\_{N \rightarrow \infty}} \frac{N}{2 N-1}\left(1-\frac{1}{2 N-1}\right)^{2(N-1)}=\frac{1}{2 e} \\\\ \Rightarrow U_{\max }=\frac{1}{2 e}=0.18 \end{array} $$

Comparison of Utilization Between Aloha and Slotted Aloha

CSMA-based Approaches

CSMA = Carrier Sense Multiple Access

CSMA/CD

CD = Collision Detection (“Listen before talk, listen while talk“)

Sending system can detect collisions by listening

Usage example: Ethernet

CSMA/CA

CA = Collision Avoidance

Sending system assumes collisions when acknowledgement is missing

MAC-layer acknowledgements, stop-and-wait

Usage example: WLAN

Ethernet Variants

Ethernet variants Data rate Topology Medium access Evaluation Layers Flow control

Original 10 Mbit/s bus CSMA/CD
Check medium

1-persistent sending

Collision detection by sender

Exponential backoff
Utilization 1 and 2a

Fast Ethernet 100 Mbit/s star CSMA/CD Implicit / Explicit

Gigabit Ethernet 1 Gbit/s star Carrier extension, frame bursting

The Original

Standardized as IEEE 802.3

Medium access control

Time multiplex, variable, random access

Asynchronous access

Uses CSMA/CD

Collisions detection through listening

Exponential backoff

1-persistent

Network topology

Originally: Bus topology

Data rate

Originally: 10 Mbit/s

Wire based

Originally: Coaxial cable

Standard consists of

Layer 1 and

Layer 2a (MAC-Protocol)

CSMA/CD-based approach

Check medium

Considered free if no activity is detected for 96 bit times

96 bit times = Inter Frame Space (IFS)

Sending: 1-persistent

1-persistent

1-persistent CSMA is an aggressive transmission algorithm. When the transmitting node is ready to transmit, it senses the transmission medium for idle or busy.

If idle, then it transmits immediately.

If busy, then it senses the transmission medium continuously until it becomes idle, then transmits the message (a frame) unconditionally (i.e. with probability=1).

In case of a collision, the sender waits for a random period of time and attempts the same procedure again.

1-persistent CSMA is used in CSMA/CD systems including Ethernet.

Collision detection by sender

Abort sending

Send jamming signal (length of 48 bit, format 1010...)

Ensure collision detection: Minimum length of frame

Exponential backoff for repeated transmissions

Collision Detection

Collision detection by sender

Detection must happen before transmission is finished

$\rightarrow$ We need Minimum duration for sending

Doubled maximum propagation delay $t\_a$ of the medium

$\rightarrow$ Minimum length of a 802.3-MAC frame required

In case of shorter frames

No reliable collision detection 🤪

No CSMA/CD, only CSMA 🤪

How to enforce minimum frame length?

Implemented transparently for the application

I.e., application can transmit small portions of data if desired

Frame is extended by padding field (PAD)

Ethernet Frame

Structure

Between two frames: IFS

Evaluation Ethernet: Utilization

🎯 Goal: Derive upper bound of utilization $U\_{max}$

Assumption

Perfect protocol

No transmission errors, no overhead, no processing time, …

Achieved throughput
$$ r\_{e}=\frac{X}{t\_{s}+t\_{a}}=\frac{X}{X / r+d / v} $$

$r\_e$: effective data rate

$X$: #bits to transmit

$t\_a$: propagation delay

$t\_s$: transmission delay

$r$: data rate

$d$: medium distance

$v$: transmission speed

Parameter $𝑎$ often used for performance evaluation
$$ a= \frac{\text{propagation delay}}{\text{transmission delay}} = \frac{t\_{a}}{t\_{s}}=\frac{d / v}{X / r}=\frac{r d}{X v} $$

Utilization under optimal circumstances
$$ U\_{\max }=\frac{r\_{e}}{r}=\frac{1}{1+a} $$

Local network with $𝑁$ active systems

Each system can always send a frame

System sends frames with probability $𝑝$

Maximum normalized propagation delay of $𝑎$

I.e., transmission time $t\_s$ of each frame is normalized to 1

Time is logically partitioned in time slots

Length is doubled end-to-end propagation delay (i.e., $2a$)

Observations

Two types of time intervals

Transmission intervals: $\frac{1}{2a}$ time slots

Transmission time $t\_s$ is normalized to 1

Length of each time slot is $2a$

$\Rightarrow$ We need $\frac{1}{2a}$ time slots

Collision intervals: collisions or no transmissions

$$ U\_{\max }=\frac{\text { Transmission interval }}{\text { Transmission interval }+\text { Collision interval }} $$

Evaluation
$$ \lim \_{N \rightarrow \infty} U\_{\max }=\frac{1}{1+3.44 a} $$

Details

Average length $l\_k$ of a collision interval (measured in time slots)

Probability $A$ that exactly one system is sending:
$$ A = Np(1 - p)^{N-1} $$

Function has maximum at $p^\* = \frac{1}{N} \Rightarrow A^\* = (1 - \frac{1}{N})^{N-1}$

Probability that in $i$ following time slots a collision or no transmission occurs,

followed by a time slot with transmission
$$ \left(1-A^{\*}\right)^{i} A^{\*} $$

Average length $l\_k$:
$$ E\left[l\_{k}\right]=\sum\_{i=1}^{\infty} i\left(1-A^{\*}\right)^{i} A^{\*} \to \frac{1-A^\*}{A\*} $$

Therefore
$$ U\_{\max }=\frac{\text { Transmission interval }}{\text { Transmission interval }+\text { Collision interval }} = \frac{1 /(2 a)}{1 /(2 a)+\left(1-A^{\*}\right) / A^{\*}}=\frac{1}{1+2 a\left(1-A^{\*}\right) / A^{\*}} $$
For increasing number $N$ of systems
$$ \lim \_{N \rightarrow \infty} A^{\*}=\lim \_{N \rightarrow \infty}\left(1-\frac{1}{N}\right)^{N-1}=1 / e $$ $$ \Rightarrow \lim \_{N \rightarrow \infty} U\_{\max }=\frac{1}{1+3.44 a} $$

Fast Ethernet

Standardization: 1995 standardized as IEEE 802.3u (100Base-TX)

Important features

Data rate: 100 Mbit/s

Switchable between 10 Mbit/s and 100 Mbit/s

Automatic negotiation

Network topology: Star

Medium access control

CSMA/CD (for half duplex links)

Preserve Ethernet frame format

Modified encoding

Ethernet Flow Control

Goal: Avoid packet losses due to buffer overflow

Approach: Reduce traffic transmitted to the switch

Apply flow control at layer 2

Half-duplex links (shared LAN)

Implicit flow control

Backpressure: prevent potential transmitter from actually sending traffic

Full-duplex links

Explicit flow control

Implicit Flow Control

Half-duplex links

Two backpressure methods

Enforce collision

Pretend medium is busy

Explicit Flow Control

Full duplex link

Pause function

Receiver transmits PAUSE frame in case of an overload situation

Sender stops transmitting data frames when receiving a PAUSE frame

Implicit continuation after pause time given in PAUSE frame (multiple of time for sending 512 bit)

Explicit continuation when receiving PAUSE frame with time=0

PAUSE function is part of the sublayer MAC control (extension of MAC layer)

MAC Control sublayer

Handling of frames

MAC control frames terminate on the MAC control sublayer or are generated by it

All other frames are passed from/to higher layers

MAC Control Frame

Type = 0x8808

MCC: MAC Control Opcode

Code for selected control function

0x0001: PAUSE frame

MAC Control Parameters: Unused part filled with zeros at the end

Gigabit Ethernet

Important characteristics

Data rate: 1 Gbit/s

Network topology: Star

Medium access control

CSMA/CD (for half-duplex links)

Preserve Ethernet frame format

New concepts

Modify medium access control: Carrier extension

Optional possibilities for improved throughput: Frame bursting

Carrier Extension

Goal: Ensure collision detection

Approach: Increase transmission delay without modifying Ethernet frame structure

Basic enhancements

Length of time slot ≠ minimum length of frame

Minimum frame length: 512 bit

New length of time slot: 512 byte

Frame with carrier extension

Frame Bursting

Goal: Efficient transmission of short frames

Approach: Systems are permitted to send burst of frames directly following each other

First frame with extension, if required

Following frames directly back-to-back (no extension)

Schema

Summary

Today ́s Ethernet is very different from the original version developed by Metcalf and Boggs

One constant has remained: The Ethernet frame format

Spanning Tree

Bridges

🎯 Goal: Connect local area networks (LANs) on layer 2

Properties

Filter function: Detaches intra-network traffic in one LAN from inter-network-traffic to other LANs

Schema

Types

Source-Routing bridges

End systems add forwarding information in send packets

Bridges forward the packets based on this information

Sending packets is NOT transparent for the end system – it has to know the path

Technically easy but not often used in practice 🤪

Transparent bridges

Local forwarding decisions in each bridge

Forwarding information normally stored in a table (forwarding table)

Static entries as well as dynamically learned entries

End system is NOT involved in forwarding decisions

$\rightarrow$ Existence of bridges is transparent to end systems

Often used in practice (e.g., switches)

Comparison

Transparent Bridges resp. Switches

Important features

For each network interface exists an own layer 1 and MAC instance

Data path: Through MAC relay (implements forwarding on layer 2)

Control path

E.g., bridge protocol, bridge management

Logical Link Control (LLC) instances are involved

Basic Tasks

Establishing a loop-free topology

s.t. Packets must not loop endlessly in the network

$\rightarrow$ Spanning-tree algorithm

Forwarding of packets

Learning the “location” of end systems

Creation of the forwarding table

Filtering resp. forwarding of packets

Based on the information of the forwarding table

Spanning-Tree Algorithm

Task

Organize bridges in a tree topology (NO loops!)

Nodes: bridges and local networks

Edges: connections between interfaces and local networks

Not all bridges have to be part of the tree topology

Resources might not be used optimally

Forwarding of packets (Only possible along the tree)

Bridge protocol implements the Spanning-Tree algorithm

Requirements for using the bridge protocol

Group address to address all bridges in the network

Unique bridge identifier per bridge in the network

Unique interface identifier per interface in each bridge

Path costs for all interfaces of a bridge have to be known

BPDUs

Bridges send special packets: Bridge Protocol Data Units (BPDUs)

BPDU contains (among others)

Identifier of the sending bridge

Identifier of the bridge that is assumed as root bridge

Path cost from sending bridge to root bridge

Basic Steps

Determine root bridge

Initially

Bridges have no topology information

All bridges: assumption: “I am the root bridge”

Periodically send BPDU with itself as root bridge

Bridges only relay BPDUs, no “normal” packets

Receiving BPDU with smaller bridge identifier

Bridge no longer assumes that it is the root bridge

No longer issues own BPDUs

When receiving BPDU possibly update of the configuration

BPDU contains root bridge with smaller identifier

BPDU with same root bridge identifier but cheaper path to root bridge

Bridge notices that it is not the designated bridge $\rightarrow$ No longer forwards BPDUs

Determine root interfaces for each bridge

Calculate the path costs to the root bridge (Sum over costs of all interfaces on path to the root bridge)

Select interface with the lowest costs

Determine designated bridge for each LAN (loop free!)

LAN can have multiple bridges

Select bridge with lowest costs on root interface

Responsible for forwarding of packets

Other bridges in the LAN will be deactivated

Stable Phase

Root bridge periodically issues BPDUs

Only “active” bridges forward BPDUs

No more BPDUs are received

Bridge again assumes that it is the root bridge

Algorithm re-starts

After stabilization packets are forwarded over the respective ports

Based on the entries in the forwarding table

Example 1

Calculate path costs to root bridge

Determine designated bridges

The resulting spanning tree:

Example 2

HW15

Solution:

a)

b) Note: Root interface is for non-root bridge

c) When calculating designated interface, start from LAN and consider the shortest path

d)

Rapid Spanning Tree Protocol (RSTP)

Overview of some relevant changes

New port states

Alternate Port: best alternative path to root bridge

Backup Port: alternative path to a network that already has a connection

Bridge has two ports which connect to the same network

Sending BPDUs

are additionally used as “keep-alive” messages

Every bridge sends periodic BPDUs (Hello-Timer = 2s)

To the next hierarchy level in the tree

Failure of a neighbor: no BPDU for 3 times

Example

Data Center

Fri, 19 Mar 2021 00:00:00 +0000

Summary of fat tree

Introduction

Data Center

Typiically has

Large number of compute servers with virtual machine support

Extensive storage facilities

Typically uses

Off-the-shelf commodity hardware devices

Huge amount of servers

Switches with small buffers

Commodity protocols: TCP/IP, Ethernet

Should be

Extensible without massive reorganization

Reliable

Requires adequate redundancy

Highly performant

Data Center Network

Interconnects data center servers and storage components with each other

Connects data center to the Internet

Two types of traffic

Between external clients and internal servers

Between internal servers

Border routers: Connect internal network of the data center to the public Internet

Commodity protocols

TCP/IP

Ethernet

Simplified Sketch

Top-of-Rack (ToR) Ethernet switches

connect servers within a rack

Switches typically have small buffers

Can be placed directly at the „top“ of the rack

Typical data center rack has 42-48 rack units per rack

Routing/Forwarding within Data Center

Requirements

Efficient way to communicate between any two servers

Utilize network efficiently

Avoid forwarding loops

Detect failures quickly

Provide flexible and efficient migration of virtual machines between servers

Fat-Tree Topologies

🎯 Goal: Connect large number of servers by using switches that only have a limited number of ports

Characteristics

For any switch, number of links going down to its children is equal to the number of links going up to its parents

The links get „fatter“ towards the top of the tree

Structure

East-west traffic

Between internal servers and server racks

Result of internal applications, e.g.,

MapReduce,

Storage data movement between servers

North-south traffic

Result of external request from the public Internet

Between external clients and internal servers

🔴 Problems: Switches need different numbers of ports

Switches with high number of ports are expensive 💸

K-Pod Fat-Tree

Each switch has $k$ ports

Edge and aggregation switch arranged in $𝑘$ pods

$\frac{k}{2}$ edge switches and $\frac{k}{2}$ aggregation switches per pod

$\Rightarrow$ Overall: $\frac{k^2}{2}$ edge and $\frac{k^2}{2}$ aggregation switches

$\Rightarrow$ $k^2$ switches in all pods

$(\frac{k}{2})^2$ core switches, each connects to $k$ pods

$\Rightarrow$ Overall $k^2 + (\frac{k}{2})^2 = \frac{5}{4}k^2$ switches

Each edge switch connected to $\frac{k}{2}$ servers

$\Rightarrow$ Overall $\frac{k^2}{2} \cdot \frac{k}{2} = \frac{k^3}{4}$ can be connected

Each aggregation switch connected to $\frac{k}{2}$ edge and $\frac{k}{2}$ core switches

$\Rightarrow$ Overall $2 \cdot (k \cdot \frac{k}{2}) \cdot \frac{k}{2} = \frac{k^3}{2}$ links (links to servers not included)

Summary: $k$-pod fat-tree

Component number

pod $k$

edge switch $\frac{k^2}{2}$

aggregation switch $\frac{k^2}{2}$

core switch $(\frac{k}{2})^2$

server $\frac{k^3}{4}$

links between switches $\frac{k^3}{2}$

Every link is in fact a physical cable $\rightarrow$ high cabling complexity 🤪

Example: $k(=4)$-Pod Fat-Tree

👍 Advantages

All switches are identical

Cheap commodity switches can be used

Multiple equal cost paths between any hosts

🔴 Disadvantages: High cabling complexity

Routing Paths

Within a pod: $\frac{k}{2}$ paths from source to destination

Example

Between servers in different pods: $\frac{k^2}{4}$ ($= \frac{k}{2} \cdot \frac{k}{2}$) between servers in different pods

Example

Address Assignment

Suppose assigning the private IPv4 address block 10.0.0.0/8

Pods are enumerated from left to right: $[0, 𝑘 − 1]$

Switches in a pod: IP address 10.pod.switch.1

Edge switches are enumerated from left to right: $[0, \frac{k}{2} - 1]$

Enumeration continues with aggregation switches from left to right: $[ \frac{k}{2}, k - 1]$

Servers: IP address 10.pod.switch.ID

Based on the IP address of the connected edge switch

IDs are assigned to servers from left to right starting with 2

Core switches: IP address 10.k.x.y

x : starts at 1 and increments every $\frac{k}{2}$ core switches

y : enumerates each switch in a block of $\frac{k}{2}$ core switches from left to right, starting with 1

Example: IP address assignment for pod 0

Two-level Routing Tables

Example: HW17

Solution for (a):

Solution for (b):

Use the following short-hand notation for the TCAM-based routing tables

x –> a:

💡 Idea: if x.x.x.2, then choose left; if x.x.x.3 then choose right

Switch 10.1.0.1 is connected with

Server x (10.1.0.2)

Server a (10.1.0.3)

Aggregation switch 10.1.2.1

Aggregation switch 10.1.3.1

In TCAM table

For 10.1.0.2 and 10.1.0.3, there’s only ONE way to go

For x.x.x.2 (which is the first/left server connected to the edge switch), next hop will be the first/left connected aggregation switch (in this case, 10.1.2.1)

For x.x.x.3 (which is the second/right server connected to the edge switch), next hop will be the second/right connected aggregation switch (in this case, 10.1.3.1)

x –> b:

x –> c:

Ethernet

within Data Centers

🎯 Goal

Unification of network technologies in the context of data centers

Storage Area Networks (SANs)

HPC networking (High Performance Computing)

…

Ethernet as a “fabric” for data centers

Has to cope with a mix of different types of traffic $\rightarrow$ Prioritization required

Data Center Bridging

Unified, Ethernet-based solution for a wide variety of data center applications

Extensions to Ethernet

Priority-based flow control (PFC)

Link level flow control independent for each priority

Enhanced transmission selection (ETS)

Assignment of bandwidth to traffic classes

Quantized congestion notification

Support for end-to-end congestion control

Data Center Bridge Exchange

Priority-based Flow Control (PFC)

🎯Objective: avoid data loss due to congestion

Simple flow control already provided by Ethernet: PAUSE frame

All traffic on the corresponding port is paused

Priority flow control pause frame

Eight priority levels on one link

Use of VLAN identifier

$\rightarrow$ Eight virtual links on a physical link

Pause time can be individually selected for each priority level

$\rightarrow$ Differentiated quality of service possible 👏

Prioritization with Ethernet: Virtual LANs

Introduction of a new field for VLAN tags: Q header

Differentiation of traffic according to priority chosen by PCP

Enhanced Transmission Selection (ETS)

Reservation of bandwidth

Introduction of priority groups (PGs)

Can contain multiple priority levels of a traffic type

Different virtual queues in the network interface

Traffic within one priority group can be handled differently

Guarantee a minimum data rate per priority group

Unused capacity usable by other priority groups

Example

Quantized Congestion Notification (QCN)

Can be used by switch to notify source node that causes congestion

Note: PAUSE frame only send to neighbor node

Three main functions of QCN protocol

Congestion detection

Estimation of the strength of congestion

Evaluation of buffer occupancy

Predefined threshold reached $\rightarrow$ notification

Congestion notification

Feedback to congestion source via congestion notification message -

Contains quantized feedback

Congestion response

Source can limit data rate using a rate limiter

Algorithm with additive increase, multiplicative decrease (AIMD) used

Increase data rate (additive)

Autonomously in absence of feedback

Decrease data rate (multiplicative)

Upon receipt of a congestion notification message

Is lowered by a maximum of 50%

Data Center Bridge Exchange (DCBX) Protocol

Detection of capabilities and configuration of neighbors

For example, priority-based flow control

Periodic broadcasts to the neighbors

Beyond the Spanning Tree

🎯 Goals

More flexibility in terms of network topology and usage

Better utilization of the total available capacity

Scalability for networks with many bridges

Various concepts developed

Shortest Path Bridging (SPB)

Transparent Interconnection of Lots of Links (TRILL)

Common characterstics of SPB and TRILL

Provide multipath routing at layer 2

Use of link state routing: modified Intermediate-System-to-Intermediate-System (IS-IS) protocol

Use of en-/decapsulation of frames at domain border

Shortest Path Bridging

Method

Every bridge in the LAN calculates shortest paths

Shortest path trees (unique identifier in the LAN)

Paths have to be symmetric

Learning of MAC addresses

Support for equal cost multipath

Same paths for unicast and multicast

Transparent Interconnection of Lots of Links

Routing bridges (RBridges) implement TRILL

Each RBridge in the LAN calculates shortest routes to all other RBridges $\rightarrow$ Tree

Encapsulation example: data sent from S to D

RBridge RB1 encapsulates frame from S

Specifies RBridge RB3 as the target because D is behind RB3

RBridge RB3 decapsulates frame

RBridges

Encapsulation: insert TRILL header

Resulting overall header

Outer Ethernet

MAC addresses for point-to-point forwarding

Change on every hop

Current source and destination Bridge MAC addresses

TRILL header includes among others

Nickname fo ingress RBridge

Nickname of egress RBridge

Hop count

Nicknames of overall source (ingress) and destination (egress) bridges

Inner Ethernet: Source and destination MAC addresses of communicating end systems

MAC addresses of source and destination end systems

Example

TCP within Data Centers

Relevant Properties

Low round trip times (RTT)

Servers typically in close geographical proximity

Values in the range of microseconds instead of milliseconds

Incast communication

Many-to-one: multiple sources transmit data to one sink (synchronized)

Application examples: MapReduce, web search, advertising, recommendation systems …

Multiple paths

Mix of long-lived and short-lived flows

Little statistical multiplexing

Virtualization

Ethernet as a “fabric” for data centers

Commodity switches

Incast Problem in Data Centers

Incast: many-to-one communication pattern

Request is distributed to multiple servers

Servers respond almost synchronously

Often, applications can not continue until all responses are received or do worse if no responses are provided

Total number of responses can cause overflows in small switch buffers

Packet Loss in Ethernet Switch

Situation

Ports often share buffers

Individual response may be small (a few kilobytes)

Packet losses in switch possible because

Larger number of responses can overload a port

High background traffic on same port as incast or

High background traffic on a different port as incast

Packet loss causes TCP retransmission timeout

$\rightarrow$ no further data is received, so no duplicate acks can be generated

Barrier synchronization

slowest TCP connection determines efficiency

Affected TCP instance must wait for retransmission timeout

$\rightarrow$ Long periods where TCP connection can not transfer data

$\rightarrow$ Application blocked, i.e, response time increases

Improvements

Smaller minimum retransmission timeout

Desynchronization

Data Center TCP (DCTCP)

🎯 Goal: Achieve high burst tolerance, low latencies and high throughput with shallow-buffered commodity switches

Property: DCTCP works with low utilization of queues without reducing throughput

How does DCTCP achieve its goal?

Responds to strength of congestion and not to its presence

DCTCP

Modifies explicit congestion notification (ECN)

Estimates fraction of bytes that encountered congestion

Scales TCP congestion window based on estimate

ECN in the Switch

Modified explicit congestion notification (ECN)

Very simple active queue management using a threshold parameter $K$

If $\text{\# elements in queue} > K$: Set CE codepint

Marking based on instantaneous rather than average queue length

Suggestion: $𝐾 > (𝑅𝑇𝑇 ∗ 𝐶)/7$

$C$: data rate in packets/s

ECN Echo at the Receiver

New boolean TCP state variable: DCTCP Congestion Encountered (DCTCP.CE)

Receiving segments

If CE codepoint is set and DCTCP.CE is false

Set DCTCP.CE to true

Send an immediate ACK

If CE codepoint is not set and DCTCP.CE is true

Set DCTCP.CE to false

Send an immediate ACK

Otherwise: Ignore CE codepoint

Controller at the Sender

Estimates fraction of bytes sent that encountered congestion (DCTCP.Alpha)

Initialized to 1

Update:
$$ DCTCP. Apha=(1-g) * D C T C P . Alph a+g * M $$

$g$: estimation gain ($0 < 𝑔 < 1$)

$M$: fraction of bytes sent that encountered congestion during previous observation window (approximately $RTT$)
$$ \mathrm{M}=\frac{ \text{ \# marked bytes }}{ \text { \# Bytes acked (total) }} $$

Update congestion window in case of congestion
$$ C W n d=(1-D C T C P . \text { Alpha } / 2) * C W n d $$

if $𝐷𝐶𝑇𝐶𝑃. 𝐴𝑙𝑝h𝑎$ close to 0, $𝐶𝑊𝑛𝑑$ is only slightly reduced

if $𝐷𝐶𝑇𝐶𝑃. 𝐴𝑙𝑝h𝑎 = 1$, $𝐶𝑊𝑛𝑑$ is cut by factor 2

Handling of congestion window growth as in conventional TCP

Apply as usual

Slow start, additive increase, recovery from lost packets

👍 Benefits of DCTCP

Incast

If number of small flows is too large, no congestion control will help

If queue is built up over multiple RTTs, early reaction of DCTCP will help

Queue buildup: DCTCP reacts if queue is longer than $𝐾$ (instantaneously)

Reduces queueing delays

Minimizes impact of long-lived flows on completion time of small flows connections

More buffer space to absorb transient micro-bursts

Buffer pressure

Queue of a loaded port is kept small

Mutual influence among ports is reduced in shared memory switches

TCP Evolution

Fri, 19 Mar 2021 00:00:00 +0000

TCP Extensions

TCP Options: Basics

TCP Header

TCP Options

🎯 Goal: Flexibility for new developments

TCP header field

Each option is coded in TLV format (Type-Length-Value)

Has variable but limited length

number of options is limited (max. 40 bytes)

TCP header length at most 60 bytes in total (incl. options)

TLV format

Multiple of 32 bit words (If not padding is needed)

Type

Selective acknowledgements

Time stamps

Window scaling

Maximum segment size

Multipath TCP

TCP fast open

…

Length: Length of option

Value: Option data

Option Selective Acknowledgements

TCP uses cumulative acknowledgements

👍 Pro: Very robust against loss of ACK segments

👎 Cons: Inefficient loss recovery

Sender can only learn about a single lost segment per RTT

Consequently

Fast retransmit/fast recovery can only recover one lost segment

per RTT

Multiple losses often lead to retransmission timeouts and head-of-line blocking

Improvement: selective acknowledgements (SACK)

Also acknowledge “out-of-order” data

Implemented as TCP option

💡 Idea: Separately acknowledge continuous blocks of out-of-order data

Usage of SACK option negotiated during connection establishment

SACK option format

Typically, only 2-4 blocks can be “SACKed” in one segment

Case

Handling:

Use first entry of SACK option to report new information

Use subsequent entries of SACK option for redundancy Used for redundancy,

if prior ACKs were lost

Should repeat most recently sent first blocks

Different alternatives

Example

Option Window Scaling

Header field receive window remains unchanged (16 bit)

Scaling factor can be changed

E.g., measure window size in 32 bit words instead of bytes

Option is negotiated during connection establishment

Within SYN and SYN/ACK segments

Scaling factor remains unchanged during lifetime of a TCP connection

Extension SYN Cookies

Multipath TCP (MPTCP)

Motivation

🎯 Goal: Extension of TCP for parallel usage of multiple paths within a single TCP connection

Improves reliability

Increases performance

Important requirements

Application compatibility

Network compatibility

Challenges

Middleboxes

Connection vs. Subflow

MPTCP connection

Communication relation between sender and receiver

Consists of one or multiple MPTCP subflows

MPTCP subflow

Flow of TCP segments operating over an individual path

Started and terminated like a „regular“ TCP connection

Started with 3-way handshake

Closed with FIN or RST

Can be dynamically added and removed to/from an MPTCP connection

Embedding into Protocol Stack

Connection Establishment

3-way handshake of TCP

TCP option MP_CAPABLE

X, Y: token for client and server

Identification for subsequent addition/removal of subflows

Adding a Subflow

TCP option MP_JOIN

3-way handshake of TCP

Use tokens exchanged during MPTCP connection establishment

Sequence Numbers

Each MPTCP segment carries two sequence numbers

Data sequence number for overall MPTCP connection

Subflow sequence number for individual flow

Each subflow has coherent sequence numbers without „holes“

Congestion Control

🎯 Goals of MPTCP

Improve throughput

Multipath flow should perform at least as well as a single path congestion control would on the best available path

Do not harm

Multipath flow should not take up more capacity from any of the resources shared than if it were a single flow

Balance congestion

A multipath flow should have as much traffic as possible off its most congested paths

Congestion Control algorithm only applies to increase phase of congestion avoidance

Unchanged: slow start, fast retransmit, fast recovery and multiplicative decrease

Different congestion windows

$CWnd\_i$ per subflow $i$

$CWnd\_{total}$ per MPTCP connection (multipath flow)

Assumption: Congestion window maintained in bytes

Basic approach: Couple congestion control of different subflows

Linked increase (congestion avoidance)

For each ACK received on subflow $i$, increase $CWnd\_i$ by
$$ \min \left( \underbrace{\frac{\alpha * \text { bytes }\_{\text {acked }} * M S S\_{i}}{C W n d_{\text {total }}}}\_{\text{ Increase for multipath subflow }}, \underbrace{\frac{\text { bytes }\_{\text {acked }} * M S S\_{i}}{C W n d\_{i}}}\_{\text{ Increase „regular“ TCP would get in same scenario }}\right) $$
(any multipath subflow cannot be more aggressive than a TCP flow in the same circumstances (do not harm))

$\alpha$: Describes aggressiveness of multipath flow $$ \alpha=C W n d\_{\text {total }} \cdot \frac{\max \_{i}\left(\frac{C W n d\_{i}}{R T T\_{i}^{2}}\right)}{\left(\sum \frac{C W n d\_{i}}{R T T\_{i}}\right)^{2}} $$

TCP in Networks with High BDP

Scalability Issues

It can take very long until the available data rate is fully utilized

Cause

Very conservative behavior of congestion avoidance

Congestion window grows by one MSS per RTT

Slow window growth in congestion avoidance causes low average data rate

➡️ NOT efficient in networks with high bandwidth-delay products

Require faster increase of the congestion window in congestion avoidance

Faster Increase of Congestion Window

🎯 Goals

High resource utilization in networks with high bandwidth delay product

Quick reactions to changes of the situation within the network

Fairness with respect to other TCP variants

Different types of fairness

intra protocol fairness

All senders use same TCP variant

Goal: All flows should achieve same data rate

With new TCP variants: inter protocol fairness

Furthermore: RTT fairness

Fairness among TCP flows with different RTTs

CUBIC TCP

🎯 Goals

Provide simple algorithm for networks with high bandwidth-delay product

TCP-friendly

Behaves like standard TCP (i.e., TCP Reno) in networks with short RTTs and small bandwidth

Congestion avoidance

Applies cubic function instead of linear window increase

Performance should not be worse than TCP Reno

In comparison to TCP Reno

Better RTT fairness (Window growth independent of RTT)

Better scalability to high data rates

Currently default congestion control in all major operating systems

Congestion Window Increase

Independent from RTT

Use of actual time $t$ that has passed since last congestion incident. I.e. Window growth depends on time between consecutive congestion events

Apply cubic function
$$ W(t)=C(t-K)^{3}+W_{\max } \quad \text { with } \mathrm{K}=\sqrt[3]{\frac{W_{\max }(1-\beta)}{C}} $$

$C$: predefined constant that determines aggressiveness of increase

$W\_{max}$: congestion window size at latest congestion incident

$K$: time period that it takes to increase current window to $W\_{max}$ (in case of no further congestions)

$\beta$: multiplicative decrease of congestion window

$\beta = 0.5$ for TCP-Reno

$\beta = 0.7$ for CUBIC TCP

Congestion Window over Time

Example

Three CUBIC Modes

TCP-friendly region

Ensures that CUBIC achieves at least same data rate as standard TCP in networks with small RTT

Observation: in networks with small RTTs, Cubic ́s congestion window grows slower than with TCP Reno

Approach: “emulation” of TCP Reno (which uses AIMD)

$AIMD(\alpha, \beta)$

$\alpha$: additive increase factor
$$ W = W + \alpha $$

$\beta$: multiplicative decrease factor
$$ W = \beta \cdot W $$

TCP Reno uses $AIMD(1, \frac{1}{2})$

TCP-fair increment
$$ \alpha=3 \cdot \frac{1-\beta}{1+\beta} $$

Achieves same $W\_{avg}$ as $AIMD(1, \frac{1}{2})$

Average data rate of AIMD
$$ W\_{avg} = \frac{1}{R T T} \sqrt{\frac{\alpha \cdot(1+\beta)}{2 \cdot(1-\beta) \cdot p}} $$

$p$: loss rate

Window size of emulated TCP at time $t$
$$ W\_{T C P}=W\_{\max } \cdot \beta+\frac{3 \cdot(1-\beta)}{1+\beta} \cdot \frac{t}{R T T} $$

Recall window size of TCP cubic
$$ W(t)=C(t-K)^{3}+W_{\max } $$

$\Rightarrow$ Rule

$W\_{Cubic} < W\_{TCP}$, then $CWnd$ is set to $W\_{TCP}$ each time an ACK is received

otherwise, $CWnd$ is set to $W\_{Cubic}$ each time an ACK is received

Concave region: $CWnd < W\_{max}$ and not in TCP-friendly region

For each received ACK $$ CWnd = CWnd+\frac{W\_{cubic}(t+R T T)-CWnd}{C W n d} $$

Convex region: $CWnd > W\_{max}$ and not in TCP-friendly region

$CWnd$ is increased very carefully

searching for new 𝑊𝑚𝑎𝑥

TCP and Response Time

Basic Issue

Response time

Time between initiation of a TCP connection and receipt of the requested data

Important components

Handshake of TCP connection establishment

Slow start

Transmission of the object

Macroscopic Model

Response time without applying congestion control

After 1st RTT: Client sends object request

After 2nd RTT

Client begins to receive object data

Receiver needs
$$ t = \frac{\text{object size } O}{\text{data rate } D} $$

$\Rightarrow$ lower bound:
$$ \text{Response time} \geq 2 RTT + \frac{O}{D} $$
( With small objects, response time dominated by $RTT$s)

Used Variables

$RTT$: round trip time [Seconds]

$MSS$: maximum segment size [bit]

$W$: Size of congestion window [MSS], given as multiples of MSS

$O$: Size of object that has to be transferred [bit]

$D$: Data rate [bit/s]

Observation

$RTT$s have significant influence on response time

On connection establishment: 2 $RTT$𝑠 until reception of object begins

During object transmission

Small windows create pauses: waiting for ACKs

Majority of TCP connections in the Web has short lifetime

$\rightarrow$ Slow start has significant impact on response time

🎯 Goals

Avoid „empty“ RTTs without data transport

Reduce RTTs needed for slow start

Bigger Initial Congestion Window

💡 Idea: Increase initial congestion window (IW)

at least 10 segments, thus, about 15 Kbytes

TCP Fast Open

🎯 Goal: Reduce delays that precede the transmission of an object

TCP Cookie

Goal

Avoid DoS attacks

Disallow sending data within first SYN segment of first connection establishment to a server

Establish cookie for subsequent connections

Use cookie $\rightarrow$ avoid state keeping at server

Basic steps

Client requests TFO cookie from server

Client uses TFO cookies in subsequent TCP connections

HTTP/2

QUIC

Access Networks

Wed, 24 Mar 2021 00:00:00 +0000

Introduction

Circuit Switching

„Circuit“

Logical circuit with reserved resources for data transmission

no physical cable!

No meta data (header, appendix) required during data exchange

No buffer overflows in intermediate systems!

But: possibly bad resource utilization

Use case: telephone network

ISDN

ISDN summary

ISDN = Integrated Services Digital Network

🎯 Goals

Digital up to the subscriber

Integration of different services (e.g., voice, data, images)

Offering additional services

Redialing

Direct call

Automatic call-back if receiver access is busy

Re-direction of calls

…

Architecture

Clear Separation of Access and Network

Example Topology

Simplified Architecture at Subscriber Interface

Network Termination (NT)

Termination of technical transmission

Of network ($U\_{k0}$ interface)

Of subscriber installation ($S\_0$ interface)

Power supply for subscriber installation

Detect frame errors

Local telephone switch

Media access to signaling channel (D channel, layer 2)

Signaling at layer 3

…

Adaptor: Provide ISDN functionality for non-ISDN capable device

ISDN Subscriber Interface

Basic access

2 ∗ 64kbit/s+16kbit/s ($2 ∗ 𝐵 + 𝐷\_{16}$)

Two types of logical channels

B channel: data transfer

D channel: signaling traffic

B channel

User data transmission

Data rate: 64 kbit/s

Two B channels available

Operate independent of each other

Can transmit in different directions

Can transmit different data types (voice, images, …)

Do not have to (but can) be active at the same time

Medium access

Fixed

Time slots are associated with either B channel

D Channels

Signaling (establish B channel between end systems)

Data rate: 16 kbit/s

Bidirectional communication: end system <–> network termination

Medium access

E(cho) channel

Data rate: 16 kbit/s

Unidirectional communication: network termination –> end system

Required for medium access

Carrier sensing (CS)

Collision detection (CD)

Channels and Layering

Subscriber installation

B channels

Layer 1 standardized

Layers 2-7 usage dependent

D channel: Layers 1-3 standardized

Subscriber Interface

Subscriber Interface $S\_0$

Four-wire transmission

One twin conductor per direction

Simplex operation, both directions separated

Multiplexing at $S\_0$ interface

Space division multiplex: Separation of directions

Time division multiplex: Frame structure ($S\_0$ frames)

Bus Topology at $S\_0$ Interface

Each end system has two connections to the bus

In direction to network termination: write access

In direction to end system: read access

$S\_0$ Frames

Time division multiplex in both directions

End system –> network termination

End system <– network termination

NT mirrors D channel into echo channel of incoming $S\_0$ frames

Channel Encoding

Inverse AMI code (0 “overwrites” 1)

0: alternating by positive or negative level over whole tact interval

1: represented by 0 level

D Channel: Medium Access

Systems access D channel independent of each other

E.g., to establish a connection

CSMA/CD based approach

Check medium (echo channel as mirror of D channel)

Free, when there is no activity visible for a duration of 8 bit

Protocol on layer 2 in D channel is variant of HDLC

Format of an HDLC frame

Delimited by flag (01111110)

Bit stuffing to conserve data transparency for higher layers

After 5 subsequent binary “1” sender adds a binary “0”

This happens inbetween the flags

After 5 subsequent binary “1” receiver removes a following binary “0”

Bit stuffing is done when sending the bit stream

Calculate checksum before bit stuffing

“Inversed” bit stuffing when receiving bit stream

Verify checksum after “inversed” bit stuffing

8 bit no activity on D channel represents 8 ones (inverse AMI-code)

Send: 1-persistent

Collision detection through sending system

Systems listen on E channel while sending

Other signal received on E channel than send on D channel?

0 overwrites 1

Detecting system aborts sending and continues to check medium

No further bit is send on D channel

No exponential backoff

Other system does not note anything and continues sending successfully

DSL

DSL = Digital Subscriber Line

🎯 Goal

Performant solution for subscriber connection

Support data services with higher data rates

“Invariant”: Twin conductor at the U interface = connection to customer premise

Categories

ADSL (Asymmetric DSL)

Follows the typical communication model of the WWW

A lot of data is received from the server

Much less own data is send to the server

Downstream and upstream data rates are asymmetric

Downstream (From server to subscriber): 768 kbit/s – 8 Mbit/s

Upstream (From subscriber to server): 128 kbit/s – 576 kbit/s

Subscriber connection

Splitter

Separates signal in telephone and data signal

Required at subscriber as well as in telephone switch

Works passive: Telephone signal stays available even when splitter fails

Copper twin conductor

Between splitters at subscriber and telephone switch

DSLAM

DSL Access Multiplexer

Counterpart to DSL modem at subscriber

SDSL (Symmetric DSL)

Mainly used by business customers

Most often much more expensive than ADSL

Only data, i.e., no parallel phone calls possible

Data Transmission at DSL Access

Frequency Multiplexing

Different frequencies for

Telephony

DSL upstream

DSL downstream

Sources of Signal Disturbance

Damping: primary influenced by three parameters

Distance, interference, cable diameter

Damping decreases with increasing cable diameter

–> Larger diameter permits higher data rates on same distance

Crosstalk

Interference between sender and receiver

Interference between senders –> Only some twin conductors of a cable bundle can be used for ADSL

ADSL2, VDSL2

DSL Access Network

Basic configuration

BRAS: Broadband Remote Access Server

Part of the ISPs core network

Tasks

Routes traffic to/from broadband access devices (e.g., DSLAM)

Aggregates traffic of multiple DSLAMs

Can support policy management, quality-of-service

Provides layer-2-connectivity

Provide layer-3-connectivity

Interfaces to AAA (Authentication, Authorization, Accounting)

Assigns IP addresses to clients

Setting up an ADSL Connection

Provider is at the same time network provider: Use PPP (point-to-point protocol)

Establish phase –> LCP (link control protocol)

Setup PPP connection

Negotiate connection parameters

Data rate, used carriers

Negotiate authentication method

Negotiate the Data Rate

Fixed rate

Data rate is set to fixed value

Contains “safety margin”

Adaptive rate

Negotiate the maximum reachable data rate

Authentication phase

Authentication based on negotiated method

Network phase

Assignment of IP address

Announcing address of the DNS server

Provider uses DSL resale link

Sequence

Abort previous sequence in the authentication phase

Only at this time it is known that subscriber is customer of different provider

Thereafter

Forwarding all data to other provider

Restart complete sequence

Further Access Technologies

Cable TV Network

Initially only designed for TV and broadcast transmission

Today also useable for telephony and Internet

Topology

Initially pure tree topology with coaxial cables

Today combination of glass fiber and coaxial cables

Configuration at household

Architecture

CMTS: Cable Modem Termination System

From hub to households

Data transfer

Downstream

Broadcast: all subscribers receive same signal

Cable modem filters out “own” packets

Upstream

Access to channels controlled by time multiplex (time slots)

Time slots are assigned by CMTS in the head-end

Shared medium: Reachable data rate depends on number of concurrent users

Powerline

Wertdiskrete Systeme

Fri, 27 May 2022 00:00:00 +0000

Wert- und Zeitdiskrete Systeme

Fri, 03 Jun 2022 00:00:00 +0000

Vorbemerkungen

Signale in kontinuierlicher und diskreter Zeit

kontinuierliche (konti.) Zeit

Zeit ist kontinuierliche Variable

Signal $s(t)$ nimmt bestimmten Wert $s^*(t^*)$ für beliebig kurze Zeitspanne an

Zwischen zwei beliebigen Zeitpunkte $t_1$ und $t_2$ liegen unendlich viele Zeitpunkt $t_1 \leq t \leq t_2$

Werte könne kontinuierlich oder diskret sein

Kontinuierlich in Zeit und Wert $\rightarrow$ analoges Signal

Diskrete Zeit

Diskrete Zeitpunkt $t_k, k \in \mathbb{Z}$
$$ s_k := s(t_k) $$

Zeitliche Anordnung der $t_k$ ist beliebig, aber in viele Fällen äquidistant
$$ t_k = k \cdot \Delta \quad k \in \mathbb{Z} $$

Wert können kontinuierlich oder diskret sein

Diskret in Zeit und Ort $\rightarrow$ digitales Signal

Signale können inhärent zeitdiskret sein, oder aus Abtastung kontinuierliche Signale entstehen.

Kategoriale und Kardinale Variablen

Kategoriale Variable

Nominal

The nominal scale is made up of pure labels.

The only meaningful question to ask is whether two variables have the same value: the nominal scale only allows to compare two values w.r.t. equivalence.

There is no meaningful transformation besides relabeling.

No empirical operation is permissible, i.e., there is no mathematical operation of nominal features that is also meaningful in the material world.

A typical example is the sex of a human.

The two possible values can be either written as “f” vs. “m,” “female” vs. “male”. The labels are different, but the meaning is the same.

Although nominal values are sometimes represented by digits, one must not interpret them as numbers.

For example, the postal codes used in Germany are digits, but there is no meaning in, e.g., adding two postal codes.

Similarly, nominal features do not have an ordering, i.e., the postal code 12345 is not “smaller” than the postal code 56789. Of course, most of the time there are options for how to introduce some kind of lexicographic sorting scheme, but this is purely artificial and has no meaning for the underlying objects. With respect to statistics, the permissible average is not the mean (since summa- tion is not allowed) or the median (since there is no ordering), but the mode, i.e., the most common value in the dataset.

Ordinal

The ordinal scale allows comparing values w.r.t. equivalence and rank.

Any transformation of the domain must preserve the order, which means that the transformation must be strictly increasing.

But there is still no way to add an offset to one value in order to obtain a new value or to take the difference between two values.

Example: school grades.

In the German grading system, the grade 1 (“excellent”) is better than 2 (“good”), which is better than 3 (“satisfactory”) and so on.

But quite surely the difference in a student’s skills is not the same between the grades 1 and 2 as between 2 and 3, although the “difference” in the grades is unity in both cases.

In addition, teachers often report the arithmetic mean of the grades in an exam, even though the arithmetic mean does not exist on the ordinal scale. In consequence, it is syntactically possible to compute the mean, even though the result, e.g., 2.47 has no place on the grading scale, other than it being “closer” to a 2 than a 3. The Anglo-Saxon grading system, which uses the letters “A” to “F”, is somewhat immune to this confusion.

The correct average involving an ordinal scale is obtained by the median.

Kardinale Variable

Interval

The interval scale allows adding an offset to one value to obtain a new one, or to calculate the difference between two values—hence the name.

However, the interval scale lacks a naturally defined zero. Values from the interval scale are typically represented using real numbers, which contains the symbol “0,” but this symbol has no special meaning and its position on the scale is arbitrary. For this reason, the scalar multiplication of two values from the interval scale is meaningless. Permissible transformations preserve the order, but may shift the position of the zero.

Verhältnis

The ratio scale has a well defined, non-arbitrary zero, and therefore allows calculating ratios of two values.

This implies that there is a scalar multiplication and that any transformation must preserve the zero.

Many features from the field of physics belong to this category and any transformation is merely a change of units.

Absolut

The absolute scale shares these properties, but is equipped with a natural unit and features of this scale can NOT be negative. In other words, features of the absolute scale represent counts of some quantities. Therefore, the only allowed transformation is the identity.

Wertdiskrete Systeme

Statische Systeme

Ein-/Ausgang: Zufallsvariable $u_k$ (Eingang) und $y_k$ (Ausgang), $k \in \mathbb{N}_0$

$u_k$ und $y_k$ sind wertdiskret, wobei o.B.d.A
$$ \begin{array}{l} u_{k} \in\{1,2, \cdots, p\} \\ y_{k} \in\{1,2, \ldots, M\} \end{array} $$
Stochastische Abhängigkeit $y_k$ von $u_k$:
$$ P\left(y_{k}=i \mid u_{k}=j\right) \qquad j \in\{1, \cdots, p\}, i \in\{1, \ldots, m\} $$
Anordnung der Wahrscheinlichkeit in Matrix $A_k$:
$$ \mathbf{A}_{k}=\left(\begin{array}{ccc} P\left(y_{k}=1 \mid u_{k}=1\right) & \cdots & P\left(y_{k}=M \mid u_{k}=1\right) \\ \vdots & & \vdots \\ P\left(y_{k}=1 \mid u_{k}=P\right) & \cdots & P\left(y_{k}=M \mid u_{k}=P\right) \end{array}\right) $$

Elemente $\geq 0$

Zeilensumme $= 1$

Auftrittswahrscheinlichkeit als Vektoren:
$$ \eta_{k}^{u}=\left(\begin{array}{c} P\left(u_{k}=1\right) \\ P\left(u_{k}=2\right) \\ \vdots \\ P\left(u_{k}=P\right) \end{array}\right) \qquad \eta_{k}^{y}=\left(\begin{array}{c} P\left(y_{k}=1\right) \\ P\left(y_{k}=2\right) \\ \vdots \\ P\left(y_{k}=M\right) \end{array}\right) $$

Berechnung von $\eta_k^y$ aus $\eta_k^u$ (in Vektor-Matrix-Form):
$$ \eta_{k}^{y}=\mathbf{A}_{k}^{\top} \eta_{k}^{u} $$

Details

$$ \begin{aligned} P\left(y_{k}=i\right) &=\sum_{j=1}^{P} P\left(y_{k}=i, u_{k}=j\right) \\\\ &=\sum_{j=1}^{p} P\left(y_{k}=i \mid u_{k}=j\right) \cdot P\left(u_{k}=j\right) \end{aligned} $$

Spezialfall: $u_k = j^*$ ist bekannt, also
$$ \begin{array}{l} P\left(u_{k}=j^{*}\right)=1 \\ P\left(u_{k}=j\right)=0 \quad j=1, \cdots M, j \neq j^{*} \end{array} $$ $$ \begin{aligned} \Rightarrow \quad P\left(y_{k}=i\right) &=\sum_{j=1}^{p} p\left(y_{k}=i \mid u_{k}=j\right) P\left(u_{k}=j\right) \\ &=P\left(y_{k}=i \mid u_{k}=j^{*}\right) \end{aligned} $$
In Vektor-Matrix-Form:
$$ \eta_{k}^{y}={\underbrace{\mathbf{A}_{k}\left(j^{*}, :\right)}_{\text{die } j^*-\text{te Zeile von } A_k}}^\top=\left(P\left(y_{k}=1 \mid u_{k}=j^{*}\right) \cdots P\left(y_{k}=M \mid u_{k}=j^{*}\right)\right)^{\top} $$
Dynamische Systeme

Der aktuellen Ausgang $y_k$ ist abhängig von

dem aktuellen Eingang $u_k$

dem aktuellen Zustand $x_k$

Aufteilung des dynamischen Systems in zwei Teile

Systemabbildung (dynamischer Teil): beschreibt zeitliche Entwicklung des Zustands $x_k$

Messabbildung (statischer Teil): beschreibt die Abbildung des Ausgang $y_k$ von Zustand $x_k$ (und evtl. von aktuellem Eingang $u_k$)

Systemabbildung

Zufallsvariable $x_k, k \in \mathbb{N}_0$ mit $x_k \in \{1, 2, \dots, N\}$

Entwicklung des Zustands $x_k$ bescrhieben ducrch
$$ P(x_{k+1}=i | x_k, \dots, x_1, x_0, u_k) $$
($u_k$ oft explizit forgelassen)

Definition

Bei $x_k$ handelt es sich um eine Markov-Ketter (erster Ordnung), falls gilt
$$ P\left(x_{k+1}=i \mid x_{k}, \ldots, x_{1}, x_{0}, u_{k}\right)=P\left(x_{k+1}=i \mid x_{k}, u_{k}\right) $$

Die zukünftige Entwicklung $x_{k+1}$ ist bedingt unabhängig von vergangen Zuständen $x_{k-1}, \dots, x_1, x_0$, falls aktueller Zustand $x_k$ bekannt ist

Vereinfachte Übergangswahrscheinlichkeit
$$ P(x_{k+1} = j| x_k = i) $$

Definition

Eine Markov-Kette wird als Zeithomogen oder allg. als zeitinvariant bezeichnet, falls die Übergangswahrscheinlichkeit nicht von Zeitindex abhängen, d.h. es gilt
$$ P\left(x_{k+1}=j \mid x_{k}=i\right)=\mathbf{A}(i, j) $$
Übergangsmatrix (zeithomogen):
$$ \mathbf{A}=\left(\begin{array}{cccc} A(1,1) & A(1,2) & \ldots & A(1, N) \\\\ A(2,1) & A(2,2) & \cdots & A(2, N) \\\\ \vdots & \vdots & & \vdots \\\\ A(N, 1) & A(N, 2) & \cdots & A(N, N) \end{array}\right) $$

Definition

Eine quadratische Matrix $\mathbf{A}$ heißt Markov-Matrix, falls

Alle Elemente nicht-negative sind
$$ A(i, j) \geq 0 \quad \text{ für } i, j \in \\{1, \dots, N\\} $$

Die Zeilensumme gleich 1
$$ \sum_{i=1}^{N} A(i, j)=1 \quad \text{für } i \in \\{1, \dots, N\\} $$

Graphische Darstellung einer Markov-Kette:

z.B. $N=2, x_k \in \\{1, 2\\}$

Messabbildung

Zustand typischerweise NICHT direkt verfügbar (latente Variable)

Messabbildung vom Zustand $x_k$ und dem aktuelle Eingang $u_k$ auf aktuelle Ausgang $y_k$
$$ P\left(y_{k}=j \mid x_{k}=i, u_{k}=m\right) $$

$u_k$ oft explizit forgelassen

Zeithomogen (allg. zeitinvariant)
$$ P\left(y_{k}=j|x_{k}=i\right)=B(i, j) $$

Messe-/Beobachtungsmatrix
$$ \mathbf{B}=\left[\begin{array}{ccc} B(1,1) & \cdots & B(1, M) \\ \vdots & & \vdots \\ B(N, 1) & \cdots & B(N, M) \end{array}\right] $$

Gesamtes Dynamisches System

Hidden Markov Model

Zustand

Wert $x_k, k=1,2,\dots$

Verteilung $\eta_k^x, k=1,2,\dots$

Initialer Zustand

Wert $x_0$

Verteilung $\eta_0^x$

Eingänge

Werte $u_k, k=0,1,\dots$

Verteilung $\eta_k^u,k=0,1,\dots$

Ausgänge

Werte $y_k, k=0,1,\dots$

Verteilung $\eta_k^y,k=0,1,\dots$

Systemabbildung $\mathbf{A}_k$

Messabbildung $\mathbf{B}_k$

Graphische Darstellung

Ausgerollte zeitliche Abhängigkeit der Zufallsvariablen

Markot-Kette (ausgerollte Darstellung)

Rekursive Darstellung der zeitliche Abbildung der Zufallsvariablen

Markot-Kette (rekursive Darstellung)

Betont Übergange und Wahrscheinlichkeit

Markot-Kette (betont Übergange und Wahrscheinlichkeit)

Zustandsschätzung

Wed, 08 Jun 2022 00:00:00 +0000

Vorbemerkungen

Bayessches Gesetz und erweiterte Konditionierung
$$ \begin{array}{l} &P(a \mid b) \cdot P(b)=P(a, b)=P(b \mid a) \cdot P(a) \\\\ \Rightarrow &P(b \mid a)=\frac{P(a | b) \cdot P(b)}{P(a)} \end{array} $$
Erweiterte Konditionierung:
$$ \begin{array}{l} P(b \mid a, c) \cdot \underbrace{P(a, c)}_{P(a \mid c) \cdot P(c)}=P(a, b, c)=P(a \mid b, c) \cdot \underbrace{P(b, c)}_{P(b \mid c) \cdot P(c)} \\\\ \Rightarrow P(b \mid a, c) \cdot P(a \mid c)=P(a \mid b, c) \cdot P(b \mid c) \quad (\triangle) \\\\ \Rightarrow P(b \mid a, c)=\frac{P(a \mid b, c) \cdot P(b \mid c)}{P(a \mid c)} \end{array} $$
Notation zu Abhängigkeit vom Eingang

Abhängigkeit der Systemmatrizen $\mathbf{A}_k$ (Übergangsmatrix) und $\mathbf{B}_k$ (Messe-/Beobachtungsmatrix) von Eingang $u_k$ (4 dimensionale Felde):

Schreibweise:
$$ \begin{array}{l} &A\left(k, u_{k}, X_{k+1} = x_{k+1}, X_{k}=x_{k}\right) \\ = &A\left(k, u_{k}, x_{k+1}, x_{k}\right) \\ = &A_{k}\left(u_{k}, x_{k+1}, x_{k}\right) \\ = &A_{k}^{u_{k}}\left(x_{k+1}, x_{k}\right) \\ \end{array} $$
“Zum Zeitpunkt $k$ ist der aktuelle Zustand $X_k=x_k$. Was ist die Wahrscheinlichkeit vom den nächsten Zustand $X_{k+1}=x_{k+1}$, wenn der Eingang $u_k$ ist?”

Zeitinvariante Fall:
$$ A(u_k, x_{k+1}, x_k) = A_{u_k}(x_{k+1}, x_k) $$
Zustandsschätzung

Ziel

Rekonstruktion des internen Zustands aus Messungen und Eingängen (Annahme: $\mathbf{A}_k, \mathbf{B}_k$ bekannt)

Interner Zustand Schätzer

Problemformulierung

Gegeben

Eingänge $u_k, k = 0, \dots, k_u$

Messungen $y_k, k = 1, \dots, k_y$

Gesucht: Rekonstruktion des Zustands

$\hat{x}_k, k = 1, \dots, k_x$ (alle interne Zustände)

$\hat{x}_{k_x}$ (der letzte Zustand)

Bsp Darstellung

Paradigma: Nutzung aller Daten

Zwei wichtige Fälle/Phasen

Prädiktion ($k_u + 1 = k_x > k_y$)

Eine Prädiktion für den aktuellen Zustand basierend auf den letzten Zustand machen

Filterung ($k_u + 1 = k_x = k_y$)

Mit der beobachtbaren Messungen die Prädiktion updaten/verfeinern

Prädiktion

Allgemein

Gegeben

Schätzung des Zustands zu einem Zeitpunkt $m$, welche gesamte Eingang- und Messhistorik bis dahin enthält

Eingänge $u_k$ für $k > m$

Systemmatrizen $A_k$ für $k>m$

Interpretation

Ab Zeitpunkt $m+1$ fehlen Messungen. Wie entwicklt sich System rein auf Basis des Systemmodells?

Gesucht

Prädiktion zu späteren Zeitpunkt $k>m$ für gegeben Eingänge bis $k-1$
$$ P(x_k \mid y_{1:m}, u_{0: k- 1}) $$
für $x_k \in \{1, \dots, N\}$

Beispiel:

$m = 2, k =3$

Prädiktion:
$$ P(x_3 \mid y_{1:2}, u_{0:2}) $$

Es ist wichtig, dem Bayessches Gesetz mit der erweiterten Konditionierung zu verwenden
$$ P(a, b \mid c) = P(a \mid b, c) \cdot P(b \mid c) \qquad (\ast) $$

Zum Zeitpunkt $k > m$:
$$ \begin{array}{l} &P\left(x_{k} \mid y_{0: m}, u_{0: k-1}\right)\\ =&P\left(x_{k} \mid y_{0}, y_{1}, \cdots, y_{m}, u_{0}, u_{1}, \cdots u_{k-1}\right) \quad \mid \text{Marginalisierung}\\ =& \displaystyle \sum_{x_{k-1}=1}^{N} P\left(x_{k}, x_{k-1} \mid y_{0: m}, u_{0: k-1}\right)\\ \overset{(\ast)}{=}&\displaystyle \sum_{x_{k-1}=1}^{N} P\left(x_{k} \mid x_{k-1}, y_{0: m}, u_{0: k-1}\right) P\left(x_{k-1} \mid y_{0: m}, u_{0: k-1}\right) \quad \mid \text{Markov}\\ =&\displaystyle \sum_{x_{k-1}=1}^{N} \underbrace{P\left(x_{k} \mid x_{k-1}, u_{k-1}\right)}_{\text {Übergangswachrshheinlicheit }} \cdot \underbrace{P\left(x_{k-1} \mid y_{0: m}, u_{0 : k-2}\right)}_{\text {Schätzung für } k-1} \quad \text { (Rekursiv nach vorne) } \end{array} $$
(Die Summe beschreibt eine Vektor-Matrix-Multiplikation.)

Anordnen der Einzelwahrscheinlichkeit in Vektoren:
$$ \eta_{k \mid 1: m}^{x} = \left(\begin{array}{c} P\left(x_{k}=1 \mid y_{1: m}, u_{0: k-1}\right) \\ \vdots \\ P\left(x_{k}=N \mid y_{1: m}, u_{0: k-1}\right) \end{array}\right) \qquad \eta_{k-1 \mid 1: m}^{x}=\left(\begin{array}{c} P\left(x_{k-1}=1 \mid y_{1: m}, u_{0: k-2}\right) \\ \vdots \\ P\left(x_{k-1}=N \mid y_{1: m}, u_{0: k-2}\right) \end{array}\right) $$
Rekursive Prädiktion:

Beginn: Schätzvektor $\eta_{m \mid 1: m}^{x}$

Rekursion: für $k > m$
$$ \eta_{k \mid 1: m}^{x}=\mathbf{A}_{k}^{\top} \eta_{k-1 \mid 1 : m}^{x} $$

Spezialfall: Einschrittprädiktion ($k = m + 1$)

Konkretes Beispiel

Systemmodell (zeitinvariant)

Systemabbildung
$$ \mathbf{A}_{u_{k}}=\left(\begin{array}{ll} a_{1}\left(u_{k}\right) & 1-a_{1}\left(u_{k}\right) \\ a_{2}\left(u_{k}\right) & 1-a_{2}\left(u_{k}\right) \end{array}\right) \qquad a_{1}\left(u_{k}\right), a_{2}\left(u_{k}\right) \in[0,1] $$

Reminder: $A(i, j):=P\left(x_{k+1}=j \mid x_{k}=i\right)$

Messeabbildung
$$ \mathbf{B}=\left(\begin{array}{ll} b_{1} & 1-b_{1} \\ b_{2} & 1-b_{2} \end{array}\right) \qquad b_1, b_2 \in [0, 1] $$

Reminder: $B(i, j):=P\left(y_{k}=j \mid x_{k}=i\right)$

Gegeben

Initialer Zustandsschätzvektor
$$ \eta_{0}^{x}=\left[\begin{array}{c} p_{0} \\ 1-p_{0} \end{array}\right] (=P(x_0)) \qquad p_{0} \in[0,1] $$
(Also: $P(x_0 = 1) = P_0, P(x_0 = 2) = 1- P_0$)

Werte der Eingänge $u_0, u_1, u_2$

Keine Messungen

Gesucht

Verbundverteilung für die Zeitpunkt $k = 1, 2, 3$
$$ P\left(x_{1}, x_{2}, x_{3} \mid u_{0}, u_{1}, u_{2}\right)=: P\left(x_{1,3} \mid u_{0: 2}\right) $$

Verteilung zum Zeitpunkt $k=3$
$$ p\left(x_{3} \mid u_{0}, u_{1}, u_{2}\right)=p\left(x_{3} \mid u_{0: 2}\right)=\eta_{3}^{x}\left(x_{3}\right) $$

Also wir sind in Zeitschritt 0, und möchte Prädiktion machen für

zukünftige Zustände $x_k, k = 1, 2, 3$

zukünftige Messungen $y_k, k=1,2,3$

Aufspaltung der Verbundverteilung für $k = 0, 1, 2, 3$:
$$ \begin{aligned} & P\left(x_{0: 3} \mid u_{0: 2}\right) \\ \overset{(\ast)}{=}& P\left(x_{3} \mid x_{0: 2}, u_{0: 2}\right) \cdot P\left(x_{0: 2} \mid u_{0: 2}\right) \\ \overset{\text{Markov}}{=}& P\left(x_{3} \mid x_{2}, u_{2}\right) \cdot P\left(x_{2} \mid x_{0: 1}, u_{0: 2}\right) \cdot P\left(x_{0: 1} \mid u_{0: 2}\right) \\ \overset{\text{Markov}}{=}& P\left(x_{3} \mid x_{2}, u_{2}\right) \cdot P\left(x_{2} \mid x_{1}, u_{1}\right) P\left(x_{1} \mid x_{0}, u_{0: 2}\right) \cdot P\left(x_{0} \mid u_{0: 2}\right) \\ =& P\left(x_{3} \mid x_{2}, u_{2}\right) \cdot P\left(x_{2} \mid x_{1}, u_{1}\right) \cdot P\left(x_{1} \mid x_{0}, u_{0}\right) \cdot P\left(x_{0}\right) \\ =& A_{u_{2}}\left(x_{2}, x_{3}\right) \cdot A_{u}\left(x_{1}, x_{2}\right) \cdot A_{u_{0}}\left(x_{0}, x_{1}\right) \cdot \eta_{0}^{x}\left(x_{0}\right) \end{aligned} $$
Verbundverteilung für $k = 1, 2, 3$:
$$ \begin{aligned} P\left(x_{1: 3} \mid u_{0: 2}\right) &=\sum_{x_{0}=1}^{2} P\left(x_{0: 3} \mid u_{0: 2}\right) \\ &=\underbrace{P\left(x_{3} \mid x_{2}, u_{2}\right)}_{=\mathbf{A}_{u_{2}}\left(x_{2}, x_{3}\right)} \cdot \underbrace{P\left(x_{2} \mid x_{1}, u_{1}\right)}_{=\mathbf{A}_{u_{1}}\left(x_{1}, x_{2}\right)} \cdot \underbrace{\sum_{x_0=1}^{2} P\left(x_{1} \mid x_{0}, u_{0}\right) \cdot P\left(x_{0}\right)}_{=P\left(x_{1} \mid u_{0}\right)=\eta_{1}^{*}\left(x_{1}\right)} \end{aligned} $$

$P\left(x_{1: 3} \mid u_{0: 2}\right)$ bedeutet: $P$ indiziert mit dem 3-dimensionalen Indexvekter $(1, 2, 3)^\top$. Jede von dem kann 2 Wer4te annehmen.

$$ \begin{array}{l} \eta_{1}^{x}\left(x_{1}\right) &= \sum_{x_{0}=1}^{2} A_{u_{0}}\left(x_{0}, x_{1}\right) \cdot \eta_{0}^{x}\left(x_{0}\right)\\ &=A_{u_{0}}\left(x_{0}=1, x_{1}\right) \underbrace{{P}_{0}}_{=P(x_0 = 1)}+A_{u_{0}}\left(x_{0}=2, x_{1}\right) \underbrace{\left(1-P_{0}\right)}_{=P\left(x_{0}=2\right)} \quad (\text{Marginalisierung})\\ &=\left\{\begin{array}{ll} a_{1} \cdot p_{b}+a_{2}\left(1-p_{0}\right) & x_{1}=1 \\ \left(1-a_{1}\right) p_{0}+\left(1-a_{2}\right)\left(1-p_{0}\right) & x_{1}=2 \end{array}\right. \end{array} $$ $$ \begin{aligned} P\left(x_{3} \mid u_{0: 2}\right)=&\displaystyle \sum_{x_{2}=1}^{2} \sum_{x_{1}=1}^{2} P\left(x_{1: 3} \mid u_{0: 2}\right)\\ =&\displaystyle \sum_{x_{2}=1}^{2} A_{u_{2}}\left(x_{2}, x_{3}\right) \underbrace{\displaystyle \sum_{x_{1} = 1}^{2}\left(x_{1}, x_{2}\right) \eta_{1}^{x}\left(x_{1}\right)}_{=P\left(x_{1} \mid u_{0:1}\right)=\eta_{2}^{x}\left(x_{1}\right)}\\ =&\sum_{x_{2}=1}^{2} A_{u_{2}}\left(x_{2}, x_{3}\right) \cdot \eta_{2}^{x}\left(x_{2}\right)\\\\ =& \eta_{3}^{x}\left(x_{3}\right) \end{aligned} $$ $$ \begin{aligned} \eta_{3}^{x} &=\mathbf{A}_{u_{2}}^{\top} \cdot \underbrace{\eta_{2}^{x}}_{=\mathbf{A}_{u_{1}}^{\top} \cdot \eta_{1}^{x}} \\ &=\mathbf{A}_{u_{2}}^{\top} \cdot (\mathbf{A}_{u_{1}}^{\top} \cdot \underbrace{\eta_{1}^{x}}_{=\mathbf{A}_{u_{0}}^{\top} \cdot \eta_{0}^{x}})\\ &=\mathbf{A}_{u_{2}}^{\top} \cdot (\mathbf{A}_{u_{1}}^{\top} \cdot (\mathbf{A}_{u_{0}}^{\top} \cdot \eta_{0}^{x})) \quad \text { (rekursive Berechnung) } \end{aligned} $$
Prädikition der Messungen für $k=1,2,3$:
$$ \begin{aligned} & P(y_{1}, y_{2}, y_{3}, x_{1: 3} \mid u_{0: 2}) \\\\ \overset{(\ast)}{=}& P\left(y_{1: 3} \mid x_{1: 3}, u_{0: 2}\right) \cdot P\left(x_{1: 3} \mid u_{0: 2}\right) \\\\ =& P\left(y_{1: 3} \mid x_{1: 3}\right) P\left(x_{1: 3} \mid u_{0: 2}\right) \\\\ =& P\left(y_{1} \mid x_{1: 3}\right) P\left(y_{2} \mid x_{1: 3}\right) P\left(y_{3} \mid x_{1: 3}\right) P\left(x_{1: 3} \mid u_{0: 2}\right) \\\\ =& P\left(y_{1} \mid x_{1}\right) \cdot P\left(y_{2} \mid x_{2}\right) \cdot P\left(y_{3} \mid x_{3}\right) P\left(x_{1: 3} \mid u_{0: 2}\right) \\\\ =& B\left(x_{1}, y_{1}\right) B\left(x_{2}, y_{2}\right) B\left(x_{3}, y_{3}\right) P\left(x_{1: 3} \mid u_{0: 2}\right) \end{aligned} $$
Prädikition Messung für $k=3$:
$$ \begin{aligned} & P\left(y_{3}, x_{3} \mid u_{0: 2}\right) \\ \overset{(\ast)}{=}& P\left(y_{3} \mid x_{3}, u_{0: 2}\right) \cdot P\left(x_{3} \mid u_{0: 2}\right) \\ =& P\left(y_{3} \mid x_{3}\right) \cdot P\left(x_{3} \mid u_{0: 2}\right) \\ =& B\left(x_{3}, y_{3}\right) \cdot \eta_{3}^{x}\left(x_{3}\right) \end{aligned} $$
Filterung (Wonham Filter)

Wie sieht $P\left(x_{k} \mid y_{1: k}, u_{0: k-1}\right)$ auf Basis der Prädiktion $P\left(x_{k} \mid y_{1: k-1}, u_{0: k-1}\right)$ aus?

Reminder
$$ P(b \mid a, c) \cdot P(a \mid c)=P(a \mid b, c) \cdot P(b \mid c) \quad (\triangle) $$
$$ \begin{aligned} & P\left(x_{k} \mid y_{1: k}, u_{0: k-1}\right) \\ =&\quad P(\underbrace{x_{k}}_{b} \mid \underbrace{y_{k}}_{a}, \underbrace{\left.y_{1: k-1}, u_{0: k-1}\right)}_{c}\\\\ \overset{(\triangle)}{=}& \frac{P\left(y_{k} \mid x_{k}, y_{1: k-1}, u_{0: k-1}\right) \cdot P\left(x_{k} \mid y_{1: k-1}, u_{0: k-1}\right)}{P\left(y_{k} \mid y_{1: k-1}, u_{0: k-1}\right)}\\\\ = & \frac{\overbrace{P\left(y_{k} \mid x_{k}\right)}^{\text{Likelihood}} \cdot \overbrace{P\left(x_{k} \mid y_{1: k-1}, u_{0: k-1}\right)}^{\text{Einschritt-Prädiktion}}}{\underbrace{P\left(y_{k} \mid y_{1: k-1}, u_{0: k-1}\right)}_{\text{Normalisierungskonstant}}} \end{aligned} $$

Likelihood

$$ P\left(y_{k} \mid x_{k}\right)=B_{k}\left(x_{k}, y_{k}\right) \qquad(\text{Element aus Messmatrix}) $$

Normalisierungskonstant

$$ \begin{aligned} & P\left(y_{k} \mid y_{1: k-1}, u_{0: k-1}\right) \\\\ \stackrel{\text { Margin. }}{=} & \sum_{x_{k}=1}^{N} P\left(y_{k}, x_{k} \mid y_{1: k-1}, u_{0: k-1}\right) \\\\ \overset{(\ast)}{=}& \sum_{x_{k}=1}^{N} P\left(y_{k} \mid x_{k}, y_{1: k-1}, u_{0: k-1}\right) \cdot P\left(x_{k} \mid y_{1: k-1}, u_{0: k-1}\right) \\\\ =& \sum_{x_{k} = 1}^{N} P\left(y_{k} \mid x_{k}\right) \cdot P\left(x_{k} \mid y_{1: k-1}, u_{0: k-1}\right) \end{aligned} $$

Einschrittsprädikation

$$ \eta_{k \mid 1: k-1}^{x}=\mathbf{A}_{k}^{\top} \eta_{k-1\mid1: k-1}^{x} $$
Filterung in Vektor-Matrix-Form:

Für $y_k = m$, Bilde eine Diagonalematrix $\operatorname{diag}(\mathbf{B}(:, m))$ mit Spalte des Messmatrix $\mathbf{B}(:, m)$
$$ \begin{aligned} \eta_{k \mid 1: k}^{x} &\overset{y_k = m}{=}\frac{\operatorname{diag}(\mathbf{B}(:, m)) \cdot \eta_{k \mid 1: k-1}^{x}}{\mathbb{1}_{N}^{T} \operatorname{diag}(\mathbf{B}(:, m)) \cdot \eta_{k \mid 1: k-1}^{x}} \\\\ &=\frac{\mathbf{B}(:, m) \odot \eta_{k \mid 1: k-1}^{x}}{\mathbf{B}(:, m)^\top \cdot \eta_{k \mid 1:k-1}^{x}} \end{aligned} $$

$\mathbf{1}_N$: Einsvektor

$\odot$: Elementwise-Multiplikation

Das ist ein komplett rekursives Filter $\rightarrow$ Wonham Filter

Beispiel siehe hier.

Wertekontinuierliche lineare Systeme

Wed, 15 Jun 2022 00:00:00 +0000

Statische und Dynamische Systeme

Thu, 16 Jun 2022 00:00:00 +0000

Linearität

Gegeben ein System $S$

$$ \underline{x}_k \rightarrow \underline{y}_k \qquad k \in \mathbb{N}_0 $$
Zwei Bedingungen der Linearität

Skalierung
$$ \underline{x}_k \rightarrow \underline{y}_k \Rightarrow A \cdot \underline{x}_k \rightarrow A \cdot \underline{y}_k $$

Superposition
$$ \begin{aligned} \underline{x}_k^1 \rightarrow \underline{y}_k^1, \quad \underline{x}_k^2 \rightarrow \underline{y}_k^2 \\ \Rightarrow \underline{x}_k^1 + \underline{x}_k^2 \rightarrow \underline{y}_k^1 + \underline{y}_k^2 \end{aligned} $$

Statische Systeme

Ein-/Ausgänge: Zufallsvektoren $\underline{u}_k$ und $\underline{y}_k$ ($k \in \mathbb{N}_0$ ist der Zeitschritt)

$\underline{u}_k \in \mathbb{R}^P$ und $\underline{y}_k \in \mathbb{R}^M$ sind wertekontinuierlich

Abbildung von $\underline{u}_k$ und $\underline{y}_k$ durch lineare Abbildung
$$ \underline{y}_k = \mathbf{A}_k \cdot \underline{u}_k $$
wobei $\mathbf{A}_k \in \mathbb{R}^{M \times P}$

Beschreibung der Unsicherheiten in $\underline{u}_k$ und $\underline{y}_k$ durch die ersten beiden Momente

Erwartungswert

$\underline{\hat{u}}_k := E\{\underline{u}_k\}$

$\underline{\hat{y}}_k := E\{\underline{y}_k\}$

Kovarianz Matrix

$C_k^u := \operatorname{Cov}\{\underline{u}_k\}$

$C_k^y := \operatorname{Cov}\{\underline{y}_k\}$

Beschreibung der Kenngröße $\underline{\hat{y}}_k, C_k^y$ für gegebene $\underline{\hat{u}}_k, C_k^u$
$$ \begin{aligned} \hat{y}_{k} &=E\left\{\underline{y}_{k}\right\} \\ &=E\left\{A_{k} \cdot x_{k}\right\} \\ &=A_{k} \cdot E\left\{x_{k}\right\} \\ &=A_{k} \cdot \hat{\underline{u}}_{k} \\\\ C_{k}^{y} &=\operatorname{Cov}\left\{\underline{y}_{k}\right\} \\ &=E\left\{\left(y_{k}-\hat{y}_{k}\right)\left(\underline{y}_{k}-\underline{y}_{k}\right)^{\top}\right\} \\ &=E\left\{A_{k}\left(\underline{u}_{k}-\underline{\hat{u}}_{k}\right)\left(\underline{u}_{k}-\underline{\hat{u}}_{k}\right)^{\top} A_{k}^{\top}\right\} \\ &=A_{k} E\left\{\left(\underline{u}_{k}-\hat{u}_{k}\right)\left(\underline{u}_{k}-\underline{\hat{u}}_{k}\right)^{\top}\right\} A_{k}^{\top} \\ &=A_{k} \cdot C_{k}^{u} \cdot A_{k}^{\top} \end{aligned} $$

Dynamische Systeme

Anregung hängt nicht nur vom aktuellen Eingang $\underline{u}_k$ ab (analog wie wertdiskrete Systeme), sondern auch vom aktuellen Zustand

Zustände werden in internen Speichern gespeichert

Gesamtsystem ("Gauß-Markov-Modell") besteht aus

Systemabbildung

Messabbildung

Graphische Darstellung von dynamischer Systeme

Systemabbildung

Definition

Ein lineares Zustandraummodell wird als zeitinvariant (Engl. Linear Time Invariant (LTI)) bezeichnet, falls die Systemmatrizen nicht von Zeitindex $k$ abhängen, also
$$ \mathbf{A}\_{k} = \mathbf{A}, \quad \mathbf{B}\_{k} = \mathbf{B} $$

Zeitliche Entwicklung (linear)
$$ \underline{x}_{k+1}=\mathbf{A}_{k} \cdot \underline{x}_{k}+\mathbf{B}_{k} \cdot \underbrace{(\underline{\tilde{u}}_{k}+\underline{w}_{k})}_{=\underline{u}_{k}} $$

Zustand: Zufallsvektor $\underline{x}_k \in \mathbb{R}^N, k\in \mathbb{N}_0$

Markov-Modell (erster Ordnung): $\underline{x}_{k+1}$ hängt NUR von $\underline{x}_{k}$ und $\underline{u}_{k}$ ab

Häufig wird $\underline{u}_{k}$ mit mittelwertfreien Rauschen argestellt
$$ \underline{u}_{k}=\underline{\tilde{u}}_{k}+\underline{w}_{k} $$

$\underline{\tilde{u}}_{k}$ bekannt

Zufallsvektor $\underline{w}_{k}$ mit $E\{\underline{w}_k\} = \underline{0}, \operatorname{Cov}\{\underline{w}_k\} = c_k^w$

Messabbildung

Zustand $\underline{x}_k$ typischerweise NICHT verfügbar

Ausgang $\underline{y}_{k}$ hängt von $\underline{x}_k$ und evtl. von $\underline{u}_k$

Lineare Messabbildung
$$ \underline{y}_{k}=\mathbf{H}_{k} \cdot \underline{x}_{k}+\underline{v}_{k} $$

$\underline{v}_{k}$: additives mittelwertfreien Messrauschen ($E\{\underline{w}_k\} = \underline{0}, \operatorname{Cov}\{\underline{w}_k\} = c_k^w$ )

Messabbildung ist zeitinvaraint, falls $\mathbf{H}_{k} = \mathbf{H}$

Einschub: Systemeigenschaften zeitdiskreter Systeme

Für Definitionen von Systemeigenschaften zeitdiskreter Systeme siehe Signale und Systeme¹ Seite 312 - 314.

Linearität

Ein zeitdiskretes System $\mathcal{S}$ heißt linear, wenn für zwei beliebige Eingangssignale $y_{\mathrm{e} 1, n}$ und $y_{\mathrm{e} 2, n}$ und zwei beliebige Konstanten $c_1, c_2 \in \mathbb{R}$ oder $\mathbb{C}$
$$ \mathcal{S}\left\{c_{1} y_{\mathrm{e} 1, n}+c_{2} y_{\mathrm{e} 2, n}\right\}=c_{1} \mathcal{S}\left\{y_{\mathrm{e} 1, n}\right\}+c_{2} \mathcal{S}\left\{y_{\mathrm{e} 2, n}\right\} $$
gilt.

Erweiterung auf auf $N$ Eingangssignale
$$ \mathcal{S}\left\{\sum_{i=1}^{N} c_{i} y_{\mathrm{e} i, n}\right\}=\sum_{i=1}^{N} c_{i} \mathcal{S}\left\{y_{\mathrm{e} i, n}\right\} $$

Erweiterung auf unendlich viele Eingangssignale
$$ \mathcal{S}\left\{\sum_{i=-\infty}^{\infty} c_{i} y_{\mathrm{e} i, n}\right\}=\sum_{i=-\infty}^{\infty} c_{i} \mathcal{S}\left\{y_{\mathrm{e} i, n}\right\} $$

Zeitinvarianz

Ein zeitdiskretes System $\mathcal{S}$ heißt zeitinvariant, wenn es auf ein zeitlich verschobenes Eingangssignal $y_{\mathrm{e}, n-n_{0}}$ mit dem entsprechend zeitlichverschobenen Ausgangssignal $y_{\mathrm{a}, n-n_{0}}$ antwortet
$$ y_{\mathrm{a}, n}=\mathcal{S}\left\{y_{\mathrm{e}, n}\right\} \quad \Longrightarrow \quad y_{\mathrm{a}, n-n_{0}}=\mathcal{S}\left\{y_{\mathrm{e}, n-n_{0}}\right\}. $$
Sonst heißen die Systeme zeitvariant.

Kausalität

Ein zeitdiskretes System S heißt kausal, wenn die Antwort NUR von gegenwärtigen oder vergangenen, nicht jedoch von zukünftigen Werten des Eingangssignals abhängt.

Dies bedeutet, dass für ein System $\mathcal{S}$ aus
$$ y_{\mathrm{e} 1, n}=y_{\mathrm{e} 2, n} \quad \text { für } n \leq n_{1} $$
und
$$ y_{\mathrm{a} 1, n}=\mathcal{S}\left\{y_{\mathrm{e} 1, n}\right\}, \quad y_{\mathrm{a} 2, n}=\mathcal{S}\left\{y_{\mathrm{e} 2, n}\right\} $$
stets
$$ y_{\mathrm{a} 1, n}=y_{\mathrm{a} 2, n} \quad \text { für } n \leq n_{1} $$
folgt.

Beispiel

(Übungsblatt 5, Aufgabe 1)

Ein zeidiskretes wertekontinuierliches System $S$ wird durch die Differenzengleichung
$$ y_{k}-2^{k} \cdot y_{k+1}+3 \cdot y_{k+2}^{2}=4 \cdot u_{k}-2 \cdot u_{k+1} $$
beschrieben.

Ist das System $S$ linear?

Das System $S$ ist aufgrund des Terms $y_{k+2}^{2}$ NICHT linear.

Ist das System $S$ zeitinvariant?

Das System $S$ ist wegen des zeitabhängigen Koeffizienten $2^k$ von $y_{k+1}$ zeitvariant.

Ist das System $S$ kausal?

Das System $S$ ist kausal, da $y_{k+2}$ nur von vergangenen Eingangswerten abhängt.

F. P. León and H. Jäkel. Signale und Systeme. De Gruyter Oldenbourg, Berlin, Boston, 02 Sep. 2019. ISBN 978-3-11-062632-2. doi: https://doi.org/10.1515/9783110626322. URL https://www.degruyter.com/view/title/543041. ↩︎

Zustandsschätzung: Kalman Filter

Thu, 16 Jun 2022 00:00:00 +0000

Die ausführliche Zusammenfassung für Kalman Filter siehe hier.

Prädiktion

Wir möchte ein Schritt Prädiktion für Zustand machen, also am Zeitschritt $k$ ($k > m$, $m:= \text{\#Messungen}$) die Prädiktion für den Zustand $\underline{x}_{k+1}$ zu machen

Modell:
$$ \underline{x}_{k+1}=\mathbf{A}_{k} \cdot \underline{x}_{k}+\mathbf{B}_{k} \cdot \underbrace{\left(\underline{\tilde{u}}_{k}+\underline{w}_{k}\right)}_{\underline{u_k}} $$

Initialer Schätzwert für $k$:
$$ \underline{x}_{k|1:m} $$

basiert auf Messungen $\underline{y}_{1}, \dots, \underline{y}_{m}$

Eingabewerte $\underline{\tilde{u}}_{0}, \dots, \underline{\tilde{u}}_{k-1}$

mit Erwartungswert $\underline{\hat{x}}_{k|1:m}$ und Kovarianzmatrix $C_{k|1:m}^x$

Berechnung des Erwartungswerts für $k+1$
$$ \begin{aligned} &E\left\{\underline{x}_{k+1}\right\}\\\\ =&E\left\{\mathbf{A}_{k} \cdot \underline{x}_{k}+\mathbf{B}_{k}\left(\underline{\tilde{u}}_{k}+\underline{w}_{k}\right)\right\}\\\\ =&E\left\{\mathbf{A}_{k} \cdot x_{k}+\mathbf{B}_{k} \tilde{u}_{k}+\mathbf{B}_{k} \underline{w}_{k}\right\}\\\\ =&\mathbf{A}_{k} \cdot E\left\{x_{k}\right\}+\mathbf{B}_{k} \cdot \underbrace{E\left\{\tilde{u}_{k}\right\}}_{=\tilde{\underline{u}}_{k} \text{ (da } \tilde{\underline{u}}_{k} \text{ is fix)}}+\mathbf{B}_{k} \cdot\underbrace{E\left\{\underline{w}_{k}\right\}}_{=0 \text{ ("mittelwertfrei")}}\\\\ =&\mathbf{A}_{k} \cdot \underline{\hat{x}}_{k|1: m}+\mathbf{B}_{k} \tilde{\underline{u}}_{k} \qquad (+) \end{aligned} $$

Berechnung der Kovarianzmatrix $C_{k+1|1:m}^x$
$$ \begin{aligned} \underline{x}_{k+1} &=\mathbf{A}_{k} \underline{x}_{k}+\mathbf{B}_{k} \underline{u}_{k} \\ &=\left[\begin{array}{ll} \mathbf{A}_{k} & \mathbf{B}_{k} \end{array}\right]\left[\begin{array}{c} \underline{x}_{k} \\ \underline{u}_{k} \end{array}\right] \end{aligned} $$ $$ \begin{aligned} \underline{x}_{k+1}-\hat{\underline{x}}_{k+1} &=\left[\begin{array}{ll} \mathbf{A}_{k} & \mathbf{B}_{k} \end{array}\right]\left[\begin{array}{c} \underline{x}_{k}-\hat{\underline{x}}_{k} \\ \underline{u}_{k}-\underline{\hat{u}}_{k} \end{array}\right] \\ &=\left[\begin{array}{ll} \mathbf{A}_{k} & \mathbf{B}_{k} \end{array}\right]\left[\begin{array}{c} \underline{x}_{k}-\underline{\hat{x}}_{k} \\ \underline{w}_{k} \end{array}\right] \end{aligned} $$
Annahme: Zustand und Systemrauschen sind unkorreliert
$$ \begin{aligned} \operatorname{Cov}\left\{\left[\begin{array}{c} \underline{x}_{k} \\ \underline{\tilde{u}}_{k} \end{array}\right]\right\} &=E\left\{\left[\begin{array}{c} \underline{x}_{k}-\underline{\hat{x}}_{k} \\ \underline{w}_{k} \end{array}\right]\left[\left(\underline{x}_{k}-\underline{\hat{x}}_{k}\right)^{\top} \underline{w}_{k}^{\top}\right]\right\} \\ &=\left[\begin{array}{cc} C_{k \mid 1: m}^{x} & 0 \\ 0 & C_{k}^{w} \end{array}\right] \end{aligned} $$ $$ \begin{aligned} \mathbf{C}_{k+1 \mid 1 : m}^{x} &=E\left\{\left(\underline{x}_{k+1}-\hat{x}_{k+1}\right)\left(x_{k+1} - \hat{\underline{x}}_{k+1}\right)^\top\right\} \\ &=\left[\begin{array}{ll} \mathbf{A}_{k} & \mathbf{B}_{k} \end{array}\right] \cdot E\left\{\left[\begin{array}{c} \underline{x}_{k}-\hat{\underline{x}}_{k} \\ \underline{w}_{k} \end{array}\right]\left[\begin{array}{ll} \underline{x}_{k}-\hat{\underline{x}}_{k} & \underline{w}_{k} \end{array}\right]^\top\right\} \cdot\left[\begin{array}{l} \mathbf{A}_{k}^{\top} \\ \mathbf{B}_{k}^{\top} \end{array}\right] \\\\ &=\left[\begin{array}{ll} \mathbf{A}_{k} & \mathbf{B}_{k} \end{array}\right] \cdot\left[\begin{array}{cc} \mathbf{C}_{k \mid 1:m} & 0 \\ 0 & \mathbf{C}_{k}^{w} \end{array}\right] \cdot\left[\begin{array}{l} \mathbf{A}_{k}^{\top} \\ \mathbf{B}_{k}^{\top} \end{array}\right] \\ &=\mathbf{A}_{k} \cdot \mathbf{C}_{k \mid 1: m}^{x} \mathbf{A}_{k}^{\top}+\mathbf{B}_{k} \mathbf{C}_{k}^{w} \mathbf{B}_{k}^{\top} \qquad(++) \end{aligned} $$

Rekursive Prädiktion

Beginn mit Erwartungswert $\underline{\hat{x}}_{m|1:m}$ und Kovarianzmatrix $C_{m|1:m}^x$

Rekursion mit $(+)$ und $(++)$ für $k > m$

Beispiele: Übungsblatt 5, Aufgabe 4

Filterung

Erinnerung

Struktur des dynamischen Systems

Graphische Darstellung von dynamischer Systeme

Messabbildung
$$ \underline{y}\_{k}=\mathbf{H}\_{k} \cdot \underline{x}\_{k}+\underline{v}\_{k} $$

Ansatz: Linearer Schätzer
$$ \underline{x}_{k \mid 1: k}=\mathbf{K}_{k}^{(1)} \underline{x}_{k \mid 1: k-1}+\mathbf{K}_{k}^{(2)} \underline{y}_{k} \qquad(\ast) $$
🎯 Wir suchen den sog. BLUE-Filter (Best Linear Unbiased Estimator) 💪

Ein Schätzer heißt erwartungstreu , wenn sein Erwartungswert gleich dem wahren Wert des zu schätzenden Parameters ist.

Ist eine Schätzfunktion nicht erwartungstreu, spricht man davon, dass der Schätzer verzerrt ist. Das Ausmaß der Abweichung seines Erwartungswerts vom wahren Wert nennt man Verzerrung oder Bias. Die Verzerrung drückt den systematischen Fehler des Schätzers aus.

Source und Bsp: Wiki

Erwartungswerttreue (unbiased)
$$ \begin{aligned} E\left\{\underline{x}_{k \mid 1: k}\right\}&=E\left\{\mathbf{K}_{k}^{(1)} \underline{x}_{k \mid 1: k-1}+\mathbf{K}_{k}^{(2)} \underline{y}_{k}\right\} \\ E\left\{\underline{x}_{k \mid 1: k}\right\}&=\mathbf{K}_{k}^{(1)} E\left\{\underline{x}_{k \mid 1: k-1}\right\}+\mathbf{K}_{k}^{(2)} E\left\{\underline{y}_{k}\right\} \\ E\left\{\underline{x}_{k \mid 1: k}\right\}&=\mathbf{K}_{k}^{(1)} E\left\{\underline{x}_{k \mid 1: k-1}\right\}+\mathbf{K}_{k}^{(2)} E\left\{\mathbf{H}_{k} \cdot x_{k}+\underline{v}_{k}\right\} \\ E\left\{\underline{x}_{k \mid 1: k}\right\}&=\mathbf{K}_{k}^{(1)} E\left\{\underline{x}_{k \mid 1: k-1}\right\}+\mathbf{K}_{k}^{(2)} \mathbf{H}_{k} E\left\{\underline{x}_{k}\right\} \quad \mid \text { Erwartungstreu } \\ \underline{\tilde{x}}&=\mathbf{K}_{k}^{(1)} \underline{\tilde{x}}+\mathbf{K}_{k}^{(2)} \mathbf{H}_{k} \cdot \underline{\tilde{x}} \\ \Rightarrow \mathbf{I} &=\mathbf{K}_{k}^{(1)}+\mathbf{K}_{k}^{(2)} \mathbf{H}_{k} \end{aligned} $$
z.B.
$$ \begin{aligned} \mathbf{K}_{k}^{(1)} &= \mathbf{I} - \mathbf{K}_{k}\mathbf{H}_{k} \\ \mathbf{K}_{k}^{(2)} &= \mathbf{K}_{k} \end{aligned} $$

Setze in $(\ast)$ ein:
$$ \underbrace{\underline{x}_{k \mid 1: k}}_{=: \underline{x}_{k}^{e}}=\left(\mathbf{I}-\mathbf{K}_{k}\mathbf{H}_{k} \right) \underbrace{\underline{x}_{k \mid 1: k-1}}_{=: \underline{x}_{k}^{p}}+\mathbf{K}_{k} \underline{y}_{k} \qquad(* *) $$
Aber der Schätzert ist noch nicht vollständig festgelegt, da $\mathbf{K}_{k}$ noch nicht festgelegt ist.

$\Rightarrow$ Wir suche $\mathbf{K}_{k}$ so, dass der resultierende Schätzer MINIMAL kovarianz aufweist. (“Minimalvarianz Schätzer”)

Nehme an, dass Messung unkorreliert mit priorer Schätzung. Aus $(\ast\ast)$ gilt
$$ \underbrace{\mathbf{C}_{k \mid 1: k}\left(\mathbf{K}_{k}\right)}_{=: \mathbf{C}_{k}^{e}\left(\mathbf{K}_{k}\right)}=\left(\mathbf{I}-\mathbf{K}_{k} \mathbf{H}_{k}\right) \underbrace{\mathbf{C}_{k \mid 1: k-1}^{x}}_{=: \mathbf{C}_{k}^{p}}\left(\mathbf{I}-\mathbf{K}_{k} \mathbf{H}_{k}\right)^{\top}+\mathbf{K}_{k} C_{k}^{v} \mathbf{K}_{k}^{\top} \qquad(\ast\ast\ast) $$
Wir betrachten nun die Filterkovarianz $\mathbf{C}_{k}^{e}$ als Funktion von $\mathbf{K}_{k}$ , d. h. $\mathbf{C}_{k}^{e}(\mathbf{K}_k)$ . Ziel ist es, das $\mathbf{K}_{k}$ so zu finden, dass die Filterkovarianz so klein wie möglich ist.

Trick: Auf Skalares Gütemaß zurückzuführen

D.h., um Kovarianzmatrizen generell vergleichen zu können, verwende man die Funktionen, die von einer $n \times n$ Matrix in $\mathbb{R}^1$ abbilden. Anders gesagt, die einer Kovarianzmatrix einen Skalar zuordnen, denn man kann nur Skalare direkt miteinander vergleichen.

Z.B., Projektion mit beliebigen Einheitsvektor $\underline{e}$
$$ P(\mathbf{K}) = \underline{e}^\top \cdot \mathbf{C}_c(\mathbf{K}) \cdot \underline{e} $$
MINIMAL Kovarianz $\Leftrightarrow$ $P(\mathbf{K})$ soll minimal sein für $\underline{e}$.

Andere mögliche skalare Gütemaße:

$\operatorname{Spur}(\cdot)$: Summe der Diagonalelemente
$$ \begin{equation} \operatorname{Spur}(\mathbf{C})=\sigma\_{x}^{2}+\sigma\_{y}^{2} \end{equation} $$

$\operatorname{det}(\cdot)$: Determinante, also Produkt der Eigenwerte
$$ \operatorname{det}(\mathbf{C})=\sigma\_{x}^{2} \cdot \sigma\_{y}^{2} $$

Beispiel

Ableitung mit der Matrizen Differenzregeln:
$$ \begin{aligned} \frac{\partial}{\partial \mathbf{K}} P(\mathbf{K}) &=\frac{\partial}{\partial \mathbf{K}}\left\{\underline{e}^{\top}\left[(\mathbf{I}-\mathbf{K} \mathbf{H}) \mathbf{C}_{p}(\mathbf{I}-\mathbf{K} \mathbf{H})^{\top}+\mathbf{K} \mathbf{C}_{y} \mathbf{K}^{\top}\right] \underline{e}\right\} \\ &=\frac{\partial}{\partial \mathbf{K}}\left\{\underline{e}^{\top}\left[\mathbf{C}_{p}-\mathbf{C}_{p} \mathbf{H}^{\top} \mathbf{K}^{\top}-\mathbf{K} \mathbf{H} \mathbf{C}_{p}+\mathbf{K} \mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top} \mathbf{K}^{\top}+\mathbf{K} \mathbf{C}_{y} \mathbf{K}^{\top}\right] \underline{e}\right\} \\ &=-\left[\mathbf{H} \mathbf{C}_{p} \underline{e} \underline{e}^{\top}\right]^{\top}-\underline{e} \underline{e}^{\top}\left(\mathbf{H} \mathbf{C}_{p}\right)^{\top}+2 \underline{e} \underline{e}^{\top} \mathbf{K} \mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}+2 \underline{e} \underline{e}^{\top} \cdot \mathbf{K} \mathbf{C}_{y} \\ &\overset{!}{=} \mathbf{0} \end{aligned} $$
Also
$$ \begin{array}{l} -\mathbf{C}_{p} \cdot \mathbf{\mathbf{H}}^{\top}-\mathbf{C}_{p} \mathbf{H}^{\top}+2 \mathbf{K} \mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}+2 \mathbf{K} \mathbf{C}_{y} \stackrel{!}{=} \mathbf{0} \\ \mathbf{K}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}\right)^{\top}=\mathbf{C}_{p} \mathbf{H}^{\top} \\ \mathbf{K}=\mathbf{C}_{p} \mathbf{H}^{\top}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{\mathbf{H}}^{\top}\right)^{-1} \quad \text { (Kalman gain) } \end{array} $$
Setze $\mathbf{K}$ in $(\ast \ast)$ ein
$$ \begin{aligned} \underline{\hat{x}}_{e} &=(\mathbf{I}-\mathbf{K} \mathbf{H}) \underline{\hat{x}}_{p}+\mathbf{K} \cdot \underline{\hat{y}} \qquad \text { (combination form) } \\ &=\underline{\hat{x}}_{p}+\mathbf{K}\left(\underline{\hat{y}}-\mathbf{H} \cdot \underline{\hat{x}}_{p}\right) \qquad \text { (feedback form) } \\ &=\underline{\hat{x}}_{p}+\mathbf{C}_{p} \mathbf{H}^{\top}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\right)^{-1}\left(\underline{y}-\mathbf{H} \cdot \underline{\hat{x}}_{p}\right) \end{aligned} $$
Das ist das Kalman Filter.

Nun Setze $\mathbf{K}$ in $(\ast \ast \ast)$ ein, um die Kovarianzmatrix zu berechnen.
$$ \begin{aligned} \mathbf{C}_{e}=& {\left[\mathbf{I}-\mathbf{C}_{p} \mathbf{H}^{\top}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\right)^{-1} \mathbf{H}_{k}\right] \cdot \mathbf{C}_{p} } \\ & \cdot\left[\mathbf{I}-\mathbf{C}_{p} \mathbf{H}^{\top}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\right)^{-1} \mathbf{H}_{k}\right]^{-1} \\ &+\mathbf{C}_{p} \mathbf{H}^{\top}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\right)^{-1} \mathbf{C}_{y}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\right)^{-1} \mathbf{H} \mathbf{C}_{p} \\\\ =& \mathbf{C}_{p}-2 \mathbf{C}_{p} \mathbf{H}^{\top}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\right)^{-1} \mathbf{H} \mathbf{C}_{p} \\ &+\mathbf{C}_{p} \mathbf{H}^{\top}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\right)^{-1} \mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\right)^{-1} \mathbf{H} \mathbf{C}_{p} \\ &+\mathbf{C}_{p} \mathbf{H}^{\top}(\underbrace{\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}}_{=:\mathbf{D}})^{-1} \mathbf{C}_{y}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\right)^{-1} \mathbf{H} \mathbf{C}_{p}\\\\ =& \mathbf{C}_{p}-2 \mathbf{C}_{p} \mathbf{H}^{\top} \mathbf{D}^{-1} \mathbf{H} \mathbf{C}_{p}+\mathbf{C}_{p} \mathbf{H}^{\top} \mathbf{D}^{-1} \mathbf{D} \mathbf{D}^{-1} \mathbf{H} \mathbf{C}_{p} \\\\ =& \mathbf{C}_{p}-\mathbf{C}_{p} \mathbf{H}^{\top}\left(\mathbf{C}_{y}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top}\right)^{-1} \mathbf{H} \mathbf{C}_{p} \end{aligned} $$
Schritt für Schritt Herleitung: Übungsblatt 6, Aufgabe 1 (Sehr ausführlich und hilfreich! 👍)

Beispiel

Kompletter Kalman Filter: Übungsblatt 5 Aufgabe 3

Prädiktion: Übungsblatt 5, Aufgabe 4

Filterung: Übungsblatt 6 Aufgabe 1

Wertekontinuierliche Nichtlineare Systeme

Thu, 30 Jun 2022 00:00:00 +0000

Statische und Dynamische Systeme

Thu, 30 Jun 2022 00:00:00 +0000

Statische Systeme

Ein-/Ausgang: Zufallsvektoren $\underline{u}_k$ und $\underline{y}_k$ ($k \in \mathbb{N}_0$ ist der Zeitschritt)

$\underline{u}_k \in \mathbb{R}^P$ und $\underline{y}_k \in \mathbb{R}^M$ sind wertekontinuierlich

Abbildung von $\underline{u}_k$ und $\underline{y}_k$ durch nichtlineare Abbildung
$$ \underline{y}_{k}=\underline{a}_{k}\left(\underline{u}_{k}\right) \tag{Generatives Modell} $$

Beschreibung der Unsicherheit in $\underline{u}_k$ und $\underline{y}_k$ durch Dichten

$\underline{u}_k$ : $f_{k}^{u}\left(\underline{u}_{k}\right)$

$\underline{y}_k$ : $f_k^y(\underline{y}_k)$

Gesucht: $f_k^y(\underline{y}_k)$ zu gegeben $f_{k}^{u}\left(\underline{u}_{k}\right)$

Dynamische Systeme

Systemabbildung

Zustand $\underline{x}_k, k \in \mathbb{N}_0$ mit $\underline{x}_k \in \mathbb{R}^N$

Nichtlineare System (allg.)
$$ \underline{x}_{k+1}=\underline{a}_{k}\left(\underline{x}_{k}, \underline{\hat{u}}_{k}, \underline{w}_{k}\right) $$

Beschreibung von $\underline{x}_k$ durch Dichte $f_k^x(\underline{x}_k)$

Spezielle Rauschstruktur: Additives Rauschen
$$ \underline{x}_{k+1}=\underline{a}\left(\underline{x}_{k}, \underline{\hat{u}}_{k}\right)+\underline{w}_{k} $$

Systemrauschen $\underline{w}_k$ wird beschrieben durch Dichte $f_k^w(\underline{w}_k)$

Typische Annahme

$\underline{w}_k$ ist Gauß verteilt mit bekannten Parametern

$\underline{w}_k$ ist weißes Rauschen

White noise: uncertainties taken at different time steps are independent

Messabbildung

Nichtlineare Abbildung (allg.)
$$ \underline{y}_{k}=\underline{h}_{k}\left(\underline{x}_{u}, \underline{v}_{k}\right) $$

Spezialfall: Additives Rauschen
$$ \underline{y}_{k}=\underline{h}_{k}\left(\underline{x}_{u}\right) + \underline{v}_{k} $$
Rauschen $\underline{v}_{k}$ beschrieben durch $f_k^v(\underline{v}_k)$

Gesammtsystem

Note: Das System ist gekapselt. Von außen können wir nur $\underline{\hat{u}}_{k}$ und $\underline{y}_k$ sehen.

Lineare Vs. Nichtlineare Systeme

Linear Nichtlinear

Systemabbildung $\underline{x}_{k+1} = \mathbf{A}_k \underline{x}_k + \mathbf{B}_k (\underline{u}_k + \underline{w}_k)$ $\underline{x}_{k+1} = \underline{a}_k(\underline{x}_k, \underline{u}_k, \underline{w}_k)$

Messabbildung $\underline{y}_{k} = \mathbf{H}_k \underline{x}_k + \underline{v}_k$ $\underline{y}_k = \underline{h}_k (\underline{x}_k, \underline{v}_k)$

Nichtlineare Schätzung

Thu, 30 Jun 2022 00:00:00 +0000

Approximation durch Linearisierung

Idea

Linearisierung der nichtlinear Funktion

(Normal/Linear) Kalman Filter anwenden

Systemmodell
$$ \underline{x}_{k+1}=\underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right) \tag{Systemmodell} $$
Linearisierung der rechten Seite von $\text{(Systemmodell)}$ mit Taylor-Entwicklung von $\underline{\overline{x}}_k, \underline{\overline{u}}_k$ :
$$ \begin{array}{ll} &\underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right) = \underline{a}_{k}\left(\underline{\overline{x}}_k, \underline{\overline{u}}_k\right) &+ \overbrace{\left.\frac{\partial \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)}{\partial \underline{x}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\bar{x}}_{k}, \underline{u}_{k}=\underline{\bar{u}}_{k}}}^{=\mathbf{A}} \cdot \overbrace{(\underline{x}_{k} - \underline{\overline{x}}_k)}^{=\Delta \underline{x}_{k}} + \text{THO} \\\\ & & + \underbrace{\left.\frac{\partial \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)}{\partial \underline{u}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\bar{x}}_{k}, \underline{u}_{k}=\overline{\underline{\bar{u}}}_{k}}}_{=\mathbf{B}} \cdot (\underline{u}_{k} - \underline{\overline{u}}_k) + \text{THO} \end{array} $$

$\text{THO}$: Terme höherer Ordnung

Jacobi-Matrizen
$$ \begin{array}{l} \mathbf{A}_{k}=\left[\begin{array}{ccc} \frac{\partial a_{k, 1}}{\partial x_{k, 1}} & \cdots & \frac{\partial a_{k, 1}}{\partial x_{k, N}} \\ \vdots & & \vdots \\ \frac{\partial a_{k, N}}{\partial x_{k, 1}} & \cdots & \frac{\partial a_{k, N}}{\partial x_{k, N}} \end{array}\right]_{\underline{x}_{k}=\overline{\underline{x}}_{k}, \underline{u}_{k}= \bar{\underline{u}}_{k}} \\\\ \mathbf{B}_{k}=\left[\begin{array}{ccc} \frac{\partial a_{k, 1}}{\partial u_{k, 1}} & \cdots & \frac{\partial a_{k, 1}}{\partial u_{k, N}} \\ \vdots & & \vdots \\ \frac{\partial a_{k, N}}{\partial u_{k, 1}} & \cdots & \frac{\partial a_{k, N}}{\partial u_{k, N}} \end{array}\right]_{\underline{x}_{k}=\overline{\underline{x}}_{k}, \underline{u}_{k}= \bar{\underline{u}}_{k}} \end{array} $$

Annahme

Ableitung existiert

$\underline{a}_k(\cdot, \cdot)$ ausreichend linear um $\underline{\overline{x}}_k, \underline{\overline{u}}_k$

Vernachlässigen von $\text{THO} \Rightarrow$
$$ \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right) \approx \underline{a}_{k}\left(\underline{\overline{x}}_k, \underline{\overline{u}}_k\right)+\mathbf{A}_{k}\left(\underline{x}_k-\underline{\overline{x}}_k\right)+\mathbf{B}_{k}\left(\underline{u}_{k}-\underline{\overline{u}}_k\right) $$
Für die linke Seite von $(\text{Systemmodell})$:
$$ \underline{x}_{k+1}= \underline{\overline{x}}_{k+1} + \Delta \underline{x}_{k+1} $$
Für $\underline{u}_{k}$ definiere man
$$ \underline{u}_{k}:=\underline{\hat{u}}_{k}+\underline{w}_{k} $$
mit $E\left\{\underline{w}_{k}\right\}=0, \operatorname{Cov}\left\{\underline{w}_{k}\right\}=\mathbf{C}_{k}^{w}$

Lineariesierung: $\overline{\underline{u}}_{k} \overset{!}{=} \hat{\underline{u}}_{k} \Rightarrow \Delta \underline{u}_{k}= \underline{u}_{k} -\overline{\underline{u}}_{k} = \underline{w}_{k}$ (d.h. die Abweichung $\underline{w}_k$ ist ein Rauschen)

Äquivalentes Rauschen
$$ w_{k}^{\prime}=\mathbf{B}_{k} \cdot w_{k} \Rightarrow E\left\{w_{k}^{\prime}\right\}=0, \operatorname{Cov}\left\{w_{k}^{\prime}\right\}=\mathbf{B}_{k} \cdot \mathbf{C}_{k}^{w} \cdot \mathbf{B}_{k}^{\top} $$

Durch obige Linearisierung der beiden Seiten kann man das Systemmodell so schreiben:
$$ \underline{\overline{x}}_{k+1} + \Delta \underline{x}_{k+1} \approx \underline{a}_{k}\left(\underline{\overline{x}}_k, \underline{\overline{u}}_k\right)+\mathbf{A}_{k}\Delta \underline{x}_k+\underline{w}_k^\prime $$

Nominalteil
$$ \underline{\overline{x}}_{k+1} = \underline{a}_{k}\left(\underline{\overline{x}}_k, \underline{\overline{u}}_k\right) $$

Differentialteil
$$ \Delta \underline{x}_{k+1} \approx \mathbf{A}_{k}\Delta \underline{x}_k+\underline{w}_k^\prime $$

Messgleichung
$$ \underline{y}_{k}=\underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right) \tag{Messgleichung} $$
Linearisierung der rechten Seite um $\underline{\bar{x}}_{k}, \underline{\bar{v}}_{k}$ :
$$ \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right) \approx \underline{h}_{k}\left(\underline{\bar{x}}_{k}, \underline{\bar{v}}_{k}\right)+\mathbf{H}_{k} \cdot \underbrace{\left(\underline{x}_{k}-\underline{\bar{x}}_{k}\right)}_{=\Delta \underline{x}_k}+\mathbf{L}_{k} \cdot\left(\underline{v}_{k}-\underline{\bar{v}}_{k}\right) $$
mit Jacobi-Matrizen
$$ \mathbf{H}_{k}=\left.\frac{\partial \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right)}{\partial \underline{x}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{x}_{k}, \underline{v}_{k}=\underline{\bar{v}}_{k}} \qquad \mathbf{L}_{k}=\left.\frac{\partial \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right)}{\partial \underline{v}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{x}_{k}, \underline{v}_{k}=\underline{\bar{v}}_{k}} $$
Sei $\underline{\bar{v}}_{k} = \underline{\hat{v}}_{k}$ für mittelwertfreies $\underline{v}_{k} \Rightarrow \underline{\hat{v}}_{k} = \underline{0}$

Das Effektive Rauschen ist dann
$$ \underline{v}_{k}^\prime = \mathbf{L}_{k} \cdot \underline{v}_{k} $$
mit
$$ E\left\{\underline{v}_{k}^{\prime}\right\}=\underline{0}, \quad \operatorname{Cov}\left\{\underline{v}_{k}^{\prime}\right\}=\mathbf{L}_{k} \cdot \mathbf{C}_{k}^{v} \cdot \mathbf{L}_{k}^{\top} $$
Damit kann man die Messgliechung so umschreiben:
$$ \underline{y}_{k}=\underline{\bar{y}}_{k}+\Delta \underline{y}_{k} \approx \underline{h}_{k}\left(\underline{\bar{x}}_{k}, \underline{\bar{v}}_{k}\right)+\mathbf{H}_{k} \Delta \underline{x}_{k}+\underline{v}_{k}^{\prime} $$

Nominalteil
$$ \underline{\bar{y}}_{k} = \underline{h}_{k}\left(\underline{\bar{x}}_{k}, \underline{\bar{v}}_{k}\right) $$

Differentialteil
$$ \Delta \underline{y}_{k} \approx \mathbf{H}_{k} \Delta \underline{x}_{k}+\underline{v}_{k}^{\prime} $$

Erweitertes Kalmanfilter (EKF)

💡Linearisierung um jeweils beste Schätzung

Prädiktion

Berechnung Erwartungswert über nichtlineare Funktion
$$ \underline{\hat{x}}_{k+1}^{p}=\underline{a}_{k}\left(\underline{\hat{x}}_{k}^{e}, \hat{\underline{u}}_{k}\right) $$

Berechnung Kovarianzmatrix über die Linearisierung
$$ \mathbf{C}_{k+1}^{p} \approx \mathbf{A}_{k} \mathbf{C}_{k}^{e} \mathbf{A}_{k}^{\top}+\mathbf{C}_{k}^{w^{\prime}}=\mathbf{A}_{k} \mathbf{C}_{k}^{e} \mathbf{A}_{k}^{\top}+\mathbf{B}_{k} \mathbf{C}_{k}^{w} \mathbf{B}_{k}^{\top} $$
mit
$$ \mathbf{A}_k = \left.\frac{\partial \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)}{\partial \underline{x}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k-1}^{e}, \underline{u}_{k}=\hat{\underline{u}}_{k}} \qquad \mathbf{B}_k = \left.\frac{\partial \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)}{\partial \underline{u}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k-1}^{e}, \underline{u}_{k}=\hat{\underline{u}}_{k}} $$

Filterung

Berechnung von $\underline{\bar{y}}_k$ (Messung, die aus dem prioren Schätzwert (also die Prädiktion) bekomme, als Nominalwert zum jetztigen Zeitpunkt)
$$ \underline{\bar{y}}_k = \underline{h}_k(\underline{\bar{x}}_k^p, \underline{\hat{v}}_k) $$

Berechnung von $\Delta \underline{y}_k$
$$ \Delta \underline{y}_{k}=\underline{\hat{y}}_{k}-\underline{\bar{y}}_{k} $$

$\underline{\hat{y}}_{k}$ : wahre Messung

und
$$ \Delta \underline{y}_{k} \approx \mathbf{H}_{k} \cdot\left(\underline{x}_{k}^{e}-\underline{\hat{x}}_{k}^{p}\right)+\underline{v}_{k}^{\prime} $$
mit
$$ \mathbf{H}_{k}=\left.\frac{\partial \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right)}{\partial \underline{x}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k}^{p}, \underline{v}_{k}=\underline{\hat{v}}_{k}} \qquad \mathbf{L}_{k}=\left.\frac{\partial \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right)}{\partial \underline{v}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k}^{p}, \underline{v}_{k}=\underline{\hat{v}}_{k}} $$

Filterung Schritt
$$ \begin{aligned} \mathbf{K}_{k}&=\mathbf{C}_{k}^{p} \mathbf{H}_{k}^{\top}\left(\mathbf{L}_{k} \mathbf{C}_{k}^{v} \mathbf{L}_{k}^{\top}+\mathbf{H}_{k} \mathbf{C}_{k}^{p} \mathbf{H}_{k}^{T}\right)^{-1} \\\\ \hat{\underline{x}}_{k}^{e}&=\hat{\underline{x}}_{k}^{p}+\mathbf{K}_{k}\left[\hat{\underline{y}}_{k}-\underline{h}_{k}\left(\hat{\underline{x}}_{k}^{p}, \hat{\underline{v}}_{k}\right)\right] \\\\ \mathbf{C}_{k}^{e}&=\mathbf{C}_{k}^{p}-\mathbf{K}_{k} \mathbf{H}_{k} \mathbf{C}_{k}^{p} = (\mathbf{I} - \mathbf{K}_{k} \mathbf{H}_{k})\mathbf{C}_{k}^{p} \end{aligned} $$

Probleme bei Linearisierung

Berechnung der posteriore Verteilung nur gut für “schwache” Nichtlinearität

$\rightarrow$ Induzierte Nichtlinearität durch die Unsicherheit in priorer Dichte (Die Nichtlinearität ist induziert durch die Unsicherheit der priorer Dichte)

Wenn wir für priore Dichte kleines/schmales Rauschen (unten, schwarz) verwenden, dann funktioniert es gut.

Wenn wir das Rauschen breiter machen (unten, grün), dann kommt ein Problem vor, dass die resultierende Dichte von $y$ nicht symmetrisch ist.

Induzierte nichtlinearität heißt: wir können gar nicht sagen, die ist absolut betrachtet, besonders linear oder besonders nichtlinear. Es ist potential, Problem zu machen. Aber sie macht kein Problem, solange ich mich nur in den linken Bereich oder nur in den rechten Bereich des “Knickpunkt” aufhalten. Wenn wir die Dichte habe, die über den “Knickpunkt” weggeht, dann bekomme ich Problem. Das ist die induzierte nichtlinearität, die durch das Rauschen induziert wird.

Linearisierung nur um einen Punkt

Linearisiertes System ist i.A. zeitvariant, auch wenn originalsytstem zeitinvariant ist, da Linearisierung vom Schätzwert abhängt.

Schätzung in probabilistischer Form: Nichtlineares Kalmanfilter

Erwartungswertbildung

Gegeben:

Funktion $\underline{y}=\underline{g}(\underline{x})$

$\underline{x} \sim f_x(x)$

Gesucht: Bestimmte Momente von $\underline{y}$

Z.B. für skalares Fall
$$ y = g(x) $$
suchen wir $E\left\{y^{j}\right\}, j \in \mathbb{N}$ .

Wir wissen
$$ E\left\{y^{i}\right\}=\int_{\mathbb{R}} y^{j} f_{y}(y) d y $$
Aber

$f_y(y)$ , posteriore Dichte, ist oft nicht einfach berechbar

Falls berechbar, die Berechnung von $f_y(y)$ is viel zu aufwändig, wenn nur Momente benötigt werden

Theorem (Dualität bei Erwartungswertbildung)
$$ E\_{f\_y}\left\\{y^{j}\right\\}=E\_{f\_{x}}\left\\{[g(x)]^{j}\right\\}=\int\_{\mathbb{R}}[g(x)]^{j} f\_{x}(x) d x $$

$f_y(y)$ muss also nicht berechnet werden.

Nützlich, wenn

$f_y(y)$ schwer zu berechnen

1-order nichtlineare Momente von $f_x(\cdot)$ einfach berechenbar

Für sample-basierte Approximation der prioren Dichte $f_x(x)$
$$ f_{x}(x)=\sum_{i=1}^{L} w_{i} \delta\left(x-x_{i}\right) $$
ist berechnung der posterioren Dichte $f_y(y)$ trivial.
$$ f_y(y)=\sum_{i=1}^{L} w_{i} \delta\left(y-y_{i}\right), \quad y_{i}=g\left(x_{i}\right) $$
Damit
$$ E\left\{y^{j}\right\}=\int_{\mathbb{R}} y^{j} f_{y}(y) d y=\sum_{i=1}^{L} w_{i} y_{i}^{j}=\sum_{i=1}^{L} w_{i}\left[g\left(x_{i}\right)\right]^{j} $$
und
$$ E\left\{[g(x)]^{j}\right\}=\int_{\mathbb{R}}[g(x)]^{j} f_{x}(x) d x=\sum_{i=1}^{L} w_{i}\left[g\left(x_{i}\right)\right]^{j} $$
Die Berechnungen sind in diesem Fall identisch, aber im allgemeinem Fall gilt dies NICHT! 🤪

Prädiktion in probabilistischer Form

Systemmodell
$$ x_{k+1}=\underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right) $$

$\underline{x}$ : Zustand

$\underline{u}$ : Störgröße

Für einen Kalman Filter, wir möchte in nächsten Schritt die Erwartungswert und die Kovarianzmatrix haben.

Erwartungswert
$$ \hat{x}_{k+1}^{p}=E\left\{\underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)\right\}=\int_{\mathbb{R}^{N}} \int_{\mathbb{R}^{p}} \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right) f_{k}^{x u}\left(\underline{x}_{k}, \underline{u}_{k}\right) d\underline{x}_{k} d\underline{u}_{k} $$
In der Regel sind $\underline{x}_{k}, \underline{u}_{k}$ unabhängig. Also
$$ f_{k}^{x u}\left(\underline{x}_{k}, \underline{u}_{k}\right) = f_{k}^{e}\left(\underline{x}_{k}\right) \cdot f_{k}^{u}\left(\underline{u}_{k}\right). $$
Und nehme an, dass $\underline{x}_{k}, \underline{u}_{k}$ normalverteilt sind, also
$$ \begin{aligned} f_{k}^{e}\left(\underline{x}_{u}\right) &= \mathcal{N}(\underline{x}_{k}, \hat{x}_{k}^{e}, \mathbf{C}_{k}^{e}) \\\\ f_{k}^{u}\left(\underline{u}_{k}\right) &= \mathcal{N}\left(\underline{u}_{k}, \hat{\underline{u}}_{k}, \mathbf{C}_{k}^{w}\right) \end{aligned} $$
Für additives Rauschen
$$ \underline{x}_{k+1}=\underline{a}_{k}\left(\underline{x}_{k}\right)+\underline{u}_{k} \left(= \underline{a}_{k}\left(\underline{x}_{k}\right)+(\underline{\hat{u}}_{k} + \underline{w}_k)\right) $$
gilt
$$ \underline{\hat{x}}_{k+1}^{p}=\int_{\mathbb{R}^{n}} \underline{a}_{k}\left(\underline{x}_{k}\right) \cdot f_{k}^{e}\left(\underline{x}_{k}\right) d \underline{x}_{k}+\underline{\hat{u}}_{k} $$
Dann ist
$$ \begin{aligned} \underline{x}_{k+1} - \underline{\hat{x}}_{k+1}^{p} &= (\underline{a}_{k}(\underline{x}_{k})+(\underline{\hat{u}}_{k} + \underline{w}_k)) - \left(\int_{\mathbb{R}^{n}} \underline{a}_{k}\left(\underline{x}_{k}\right) \cdot f_{k}^{e}\left(\underline{x}_{k}\right) d \underline{x}_{k}+\underline{\hat{u}}_{k}\right) \\\\ &= \underbrace{\underline{a}_{k}(\underline{x}_{k}) - \int_{\mathbb{R}^{n}} \underline{a}_{k}\left(\underline{x}_{k}\right) \cdot f_{k}^{e}\left(\underline{x}_{k}\right) d \underline{x}_{k}}_{:= \underline{\bar{a}}_{k}(\underline{x}_{k})} + \underline{w}_k \end{aligned} $$
Die Kovarianzmatrix ist
$$ \begin{aligned} \mathbf{C}_{k+1}^{p} &= E\left\{\left(\underline{x}_{k+1}-\underline{\hat{x}}_{k+1}^{p}\right)(\underline{x}_{k+1}-\underline{\hat{x}}_{k+1}^{p})^{\top}\right\} \\\\ &= E\left\{(\underline{\bar{a}}_{k}(\underline{x}_{k}) + \underline{w}_k) (\underline{\bar{a}}_{k}(\underline{x}_{k}) + \underline{w}_k) ^ \top\right\} \\\\ &= E\left\{\underline{\bar{a}}_{k}(\underline{x}_k) \underline{\bar{a}}_{k}^\top(\underline{x}_k) + \underline{w}_k\underline{\bar{a}}_{k}(\underline{x}_k) + \underline{\bar{a}}_{k}(\underline{x}_k)\underline{w}_k^\top + \underline{w}_k\underline{w}_k^\top\right\} \\\\ &= E\left\{\underline{\bar{a}}_{k}(\underline{x}_k) \underline{\bar{a}}_{k}^\top(\underline{x}_k)\right\} + \underbrace{E\left\{\underline{w}_k\underline{\bar{a}}_{k}(\underline{x}_k)\right\}}_{=0} + \underbrace{E\left\{\underline{\bar{a}}_{k}(\underline{x}_k)\underline{w}_k^\top\right\}}_{=0} + E\left\{\underline{w}_k\underline{w}_k^\top\right\} \\\\ &= E\left\{\underline{\bar{a}}_{k}(\underline{x}_k) \underline{\bar{a}}_{k}^\top(\underline{x}_k)\right\} + E\left\{\underline{w}_k\underline{w}_k^\top\right\} \\\\ &= \int_{\mathbb{R}^{N}} \overline{\underline{a}}_{k}\left(\underline{x}_{k}\right) \overline{\underline{a}}_{k}^{\top}\left(x_{k}\right) f_{k}^{e}\left(\underline{x}_{k}\right) d \underline{x}_{k}+\mathbf{C}_{k}^{w} \end{aligned} $$
Filterung in probabilistischer Form

Einschub: Konditionierung einer Gaußschen Verbunddichte

Zufallsvektor $\underline{z}=\left[\begin{array}{l}\underline{x} \\ \underline{y}\end{array}\right]$ mit Gaußcher Verbundverteilung:
$$ f(\underline{z})=\mathcal{N}\left(\underline{z}_{1}, \underline{\hat{z}}, \mathbf{C}_{z}\right), \quad \underline{\hat{z}}=\left[\begin{array}{l} \underline{\hat{x}} \\ \underline{\hat{y}} \end{array}\right], \quad \mathbf{C}_{z}=\left[\begin{array}{ll} C_{x x} & C_{x y} \\ C_{y x} & C_{y y} \end{array}\right] $$
Gegeben: Messung $y^\ast$

Konditionale Verteilung:
$$ \begin{equation} f\left(\underline{x} \mid \underline{y}^{\ast}\right)= \mathcal{N}\left(\underline{x}, \underline{\hat{x}}^{*}, \mathbf{C}_{x}^{\ast}\right) \end{equation} $$
Dann ist
$$ \begin{array}{l} \underline{\hat{x}}^{\ast}=\underline{\hat{x}}+C_{x y} C_{y y}^{-1}\left(\underline{y}^{*}-\underline{\hat{y}}\right) \\ \mathbf{C}_{x}^{\ast}=C_{x x}-C_{x y} C_{y y}^{-1} C_{y x} \end{array} \tag{*} $$
Alternative Herleitung Kalmanfilter

Messmodell
$$ \underline{y}=\mathbf{H} \cdot \underline{x}+\underline{v} $$
Gegeben:
$$ \underline{x}_{p} \sim \mathcal{N}\left(\hat{\underline{x}}_{p}, \mathbf{C}_{p}\right), \quad \underline{v}\sim \mathcal{N}\left(\underline{0}, \mathbf{C}_{v}\right) $$
Wir definiere $\underline{z}$ als Verbund von $\underline{x}_{p}$ und $\underline{y}$
$$ \underline{z}:=\left[\begin{array}{l} \underline{x}_{p} \\ \underline{y} \end{array}\right]=\left[\begin{array}{l} \underline{x}_{p} \\ \mathbf{H} \cdot \underline{x}_{p}+\underline{v} \end{array}\right]=\left[\begin{array}{ll} \mathbf{I} & 0 \\ \mathbf{H} & \mathbf{I} \end{array}\right]\left[\begin{array}{c} \underline{x}_{p} \\ \underline{v} \end{array}\right] $$
Die Erwartungswert ist dann
$$ \underline{\hat{z}}=\left[\begin{array}{c} \hat{x}_{p} \\ \mathbf{H} \cdot \hat{x}_{p} \end{array}\right] $$
Die Kovairanzmatrix von $\underline{z}$ :
$$ \begin{array}{l} \operatorname{Cov}\{\underline{z}\}&=E\left\{\left[\begin{array}{ll} \mathbf{I} & 0 \\ \mathbf{H} & \mathbf{I} \end{array}\right]\left[\begin{array}{c} \underline{x}_{p}-\hat{\underline{x}}_{p} \\ \underline{v} \end{array}\right]\left[\begin{array}{c} \underline{x}_{p}-\underline{\hat{x}}_{p} \\ \underline{v} \end{array}\right]^{\top}\left[\begin{array}{cc} \mathbf{I} & \mathbf{H}^{\top} \\ 0 & \mathbf{I} \end{array}\right]\right\}\\\\ &=\left[\begin{array}{ll} \mathbf{I} & 0 \\ \mathbf{H} & \mathbf{I} \end{array}\right] E\left\{\left[\begin{array}{cc} \underbrace{\left(\underline{x}_{p}-\underline{\hat{x}}_{p}\right)(\underline{x}_{p}-\underline{\hat{x}}_{p})^{\top}}_{=\mathbf{C}_p} & \underbrace{\left(\underline{x}_{p}-\underline{\hat{x}}_{p}\right) \underline{v}^{\top}}_{=0} \\ \underbrace{\underline{v}\left(\underline{x}_{p}-\underline{\hat{x}}_{p}\right)}_{=0} & \underbrace{\underline{v} \underline{v}^{\top}}_{=\mathbf{C}_v} \end{array}\right]\right\}\left[\begin{array}{cc} \mathbf{I} & 0 \\ \mathbf{H} & \mathbf{I} \end{array}\right]\\\\ &=\left[\begin{array}{ll} \mathbf{I} & 0 \\ \mathbf{H} & \mathbf{I} \end{array}\right]\left[\begin{array}{cc} \mathbf{C}_p & 0 \\ 0 & \mathbf{C}_v \end{array}\right]\left[\begin{array}{cc} \mathbf{I} & 0 \\ \mathbf{H} & \mathbf{I} \end{array}\right]=\left[\begin{array}{cc} \mathbf{C}_{p} & \mathbf{C}_{p} \mathbf{H}^{\top} \\ \mathbf{H} \mathbf{C}_{p} & \mathbf{C}_{v}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top} \end{array}\right] \end{array} $$
Lasse
$$ \mathbf{C}_{x x}=\mathbf{C}_{p} \quad \mathbf{C}_{x y}=\mathbf{C}_{p} \mathbf{H}^{\top} \quad \mathbf{C}_{y x}=\mathbf{H} \mathbf{C}_{p} \quad \mathbf{C}_{y y}=\mathbf{C}_{v}+\mathbf{H} \mathbf{C}_{p} \mathbf{H}^{\top} $$
und in $(*)$ einsetzen, ergibt sich der Kalman Filter.

Filterung in probabilistischer Form

Messgleichung
$$ \underline{y}=\underline{h}(\underline{x}, \underline{v}) $$
Definiere
$$ \underline{z}=\left[\begin{array}{l} \underline{x} \\ \underline{y} \end{array}\right]=\left[\begin{array}{c} \underline{x} \\ \underline{h}(\underline{x}, \underline{v}) \end{array}\right] \Rightarrow E\{\underline{z}\}=\left[\begin{array}{c} \underline{\hat{x}}_{p} \\ E\{\underline{h}(\underline{x}, \underline{v})\} \end{array}\right] $$
Bei additivem Rauschen
$$ y=\underline{h}(\underline{x})+\underline{v} $$
gilt
$$ E\{\underline{z}\}=\left[\begin{array}{c} \underline{\hat{x}}_{p} \\ E\{\underline{h}(x)\} \end{array}\right], \quad E\{\underline{h}(x)\}=\int_{\mathbb{R}^{N}} \underline{h}(x) \underbrace{f_{p}(x)}_{= \mathcal{N}(\underline{x}, \underline{\hat{x}}_{p}, \mathbf{C}_p)} d \underline{x} \in \mathbb{R}^{M} $$
Kovarianzmatrix:

$$ E\left\{\left(\underline{x}-\underline{\hat{x}}_{p}\right)\right\} \underline{\bar{h}}^{\top}(\underline{x}) = \int_{\mathbb{R}^N} (\underline{x}-\underline{\hat{x}}_{p}) \underline{\bar{h}}^{\top} f_p(\underline{x}) d\underline{x} \in \mathbb{R}^{N \times M} $$ $$ E\left\{\underline{\bar{h}}(x) \underline{\bar{h}}^{\top}(\underline{x})\right\} = \int_{\mathbb{R}^N} \underline{\bar{h}}(x) \underline{\bar{h}}^{\top}(\underline{x}) f_p(\underline{x}) d\underline{x} \in \mathbb{R}^{M \times M} $$ $$ \operatorname{Cov}\{\underline{z}\}=\left[\begin{array}{ll} \overbrace{C_{x x}}^{\mathbb{R}^{N \times N}} & \overbrace{C_{x y}}^{\mathbb{R}^{N \times M}}\\ \underbrace{C_{y x}}_{\mathbb{R}^{M \times N}} & \underbrace{C_{y y}}_{\mathbb{R}^{M \times M}} \end{array}\right] \in R^{(N+M) \times (N+M)} $$
Einsetzen in $(\ast)$ ergibt sich der Nichtlineare Kalman Filter:
$$ \begin{array}{l} \underline{\hat{x}}_{e}=\underline{\hat{x}}_{p}+\mathbf{C}_{x y} \mathbf{C}_{y y}^{-1}(\underline{\hat{y}}-E\{\underline{h}(\underline{x})\}) \\ \mathbf{C}_{e}=\mathbf{C}_{p}-\mathbf{C}_{x y} \mathbf{C}_{y y}^{-1} \mathbf{C}_{y x} \end{array} $$

Berechnung der Momente: Unscented Kalman Filter (UKF)

Mon, 11 Jul 2022 00:00:00 +0000

Analytische Momente

Scheinbar die beste Methode, da schnell & feste Laufzeit 👍

Aber

Herleitung aufwändig

Formeln werden schnell unhandlich groß

Beispiel: Kubisches Sensorproblem (skalar)

Output $y$ ist nonlinear abhängig von dem Zustand $x$:
$$ y=h(x)+v=x^{3}+v $$
Gegeben

Priore Schätzung $x_p \sim \mathcal{N}(\hat{x}_p, \sigma_p^2)$

Messung $\hat{y}$

Rauschen $v$ ist Gaußverteilt mit $E\{v\}=0, \operatorname{Cov}\{v\}=Z_{v}^{2}$

Definiere
$$ z := \left[\begin{array}{l} x \\ y \end{array}\right] \Rightarrow E\{\underline{z}\}=\left[\begin{array}{c} \hat{x}_{p} \\ E\{h(x)\} \end{array}\right] $$
mit
$$ E\{h(x)\}=\int_{\mathbb{R}} h(x) f_{p}(x) d x=\int_{\mathbb{R}} x^{3} f_{p}(x) d x=\hat{x}_{p}^{2}+3 \hat{x}_{p} \sigma_{p}^{2}=:E_{3} $$
Definiere
$$ \bar{h}(x)=h(x)-E\{h(x)\} $$
Dann
$$ \operatorname{Cov}\{\underline{z}\}=\left[\begin{array}{ll} \mathbf{C}_{x x} & \mathbf{C}_{x y} \\ \mathbf{C}_{y x} & \mathbf{C}_{y y} \end{array}\right]=\left[\begin{array}{cc} \sigma_{p}^{2} & E\left\{\left(x-\hat{x}_{p}\right) \bar{h}(x)\right\} \\ E\left\{\left(x-\hat{x}_{p}\right) \bar{h}(x)\right\} & E\left\{\overline{h}^{2}(x)\right\}+\sigma_{v}^{2} \end{array}\right] $$ $$ \begin{aligned} E\left\{\left(x-\hat{x}_{p}\right)\bar{h}(x)\right\} &= E\left\{\left(x-\hat{x}_{p}\right)\left(x^{3}-E_{3}\right)\right\} \\ &= E\left\{x^{4}-\hat{x}_{p} x^{3}-E_{3} x+\hat{x}_{p} E_{3}\right\} \\ &= E_4 - \hat{x}_p E_3 - E_3 \hat{x}_p + \hat{x}_p E_3 \\ &= E_4 - \hat{x}_p E_3 \end{aligned} $$
mit
$$ \begin{aligned} E_{q}&=\hat{x}_{p}^{4}+6 \hat{x}_{p}^{2} \sigma_{p}^{2}+3\sigma_{p}^{4} \\\\ &=\hat{x}_{p}^{4}+6 \hat{x}_{p}^{2} 2_{p}^{2}+3\sigma_{p}^{4}-\hat{x}_{p}^{4}-3 \hat{x}_{p}^{2} \sigma_{p}^{2} \\\\ &=3 \sigma_{p}^{4}+3 \hat{x}_{p}^{2} \sigma_{p}^{2} \\\\ &=3\sigma_{p}^{2}\left(\hat{x}_{p}^{2}+2_{p}^{2}\right) \end{aligned} $$
und
$$ E\left\{\bar{h}^{2}(x)\right\}=9 \hat{x}_{p}^{4} \sigma_{p}^{2}+36 \hat{x}_{p}^{2} \sigma_{p}^{4}+15\sigma_{p}^{6} $$
In der Kalmanfilter Filterungsgleichung einsetzen ergibt sich
$$ \begin{array}{l} \hat{x}_{e}=\hat{x}_{p}+\mathbf{C}_{xy}\mathbf{C}_{yy}^{-1}(\hat{y}-E\{h(x)\}) \overset{\text{skalar}}{=} \hat{x}_{p}+\frac{\mathbf{C}_{x y}}{\mathbf{C}_{y y}}(\hat{y}-E\{h(x)\}) \\ \sigma_{y}^{2}= \sigma_{p}^{2}-\mathbf{C}_{xy}\mathbf{C}_{yy}^{-1}\mathbf{C}_{yx} \overset{\text{skalar}}{=} \sigma_{p}^{2}-\frac{\mathbf{C}_{x y}^{2}}{\mathbf{C}_{y y}} \end{array} $$
Einschub: Momente Gaußdichte

Theorem

Die zentralen Momente einer Gaußdichte sind gegeben durch
$$ C\_{i}=E\_{f}\left\\{(\boldsymbol{x}-\hat{x})^{i}\right\\}=\left\\{\begin{array}{ll} \displaystyle\prod\_{j=1, j\text{ ungeradde}}^{i-1} j \sigma^{i}=1 \cdot 3 \cdot 5 \cdots(i-1) \sigma^{i} & i \text { gerade } \\\\ 0 & i \text { ungerade } \end{array}\right. $$

Numerische Momente

Verwendung von Standardverfahren zur Integration

👍 Vorteile

Nutzung schneller Implementierungen

Einstellbare Genauigkeit

Adaptive Integration

👎 Nachteile

Nicht für das konkrete Probleme der Momentenberechnung maßgeschneidert

Basierend auf Abtastwerten der prioren Dichte

Approximation der Prioren Gaußdichte durch Samples

Verschiedene Verfahren mit unterschiedliche Komplexität, Effizienz, Genauigkeit

Zufälliges Sampling mit Zufallszahlengenerator $\rightarrow$ unabhängige Samples

Abtastung (z.B. äquidistantes Gitter)

Minimale Approximation auf den Hauptachsen

Verwendung von $2N$ oder $2N + 1$ samples ($N$: #Dimension)

Genaue Approximation auf den Hauptachsen

Allgemeine Sample-Approximation $\rightarrow$ Systematische Approximation durch Minimierung eines Gütemaßes

Einschub: Diracsche Deltafunktion

Betrachtung Grenzfall einer Gaußdichte
$$ f(x, m, \sigma)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left\{-\frac{1}{2} \frac{(x-m)^{2}}{\sigma^{2}}\right\} $$
für $\sigma \rightarrow 0$

Plotting verschiedener Gaußdichte für $m=0$.

Dirasche Deltafunktion
$$ \delta(x-m)=\lim _{\sigma \rightarrow 0} f(x, m, \sigma) $$
Wenn die Bereite gegen 0 ($\sigma \to 0$), die Höhe gegen unendlich.
$$ \int_{-\infty}^{\infty} \delta(x-m) d x=1 $$

Definition: Diracsche Deltafunktion
$$ \delta(x)=\left\\{\begin{array}{cc} \text{Nicht definiert} & x=0 \\\\ 0 & \text { sonst } \end{array}\right. $$ $$ \int_{-\infty}^{\infty} \delta(x) d x=\int_{-\varepsilon}^{\varepsilon} \delta(x) d x=1, \varepsilon>0 $$

Laut Definition hat die Dirasche Deltafunktion alle Eigenschaften einer Dichte

Wichtige Eigenschaften

$f(x) \cdot \delta(x-m)=f(m) \delta(x-m)$

$\int_{\mathbb{R}} f(x) \delta(x-m) d x=f(m)$

Heaviside Funktion (Unit Step Function)

Cumulative Verteilungsfunktion der Gaußdichte
$$ F(x)=P(\boldsymbol{x} \leq x)=\int_{-\infty}^{x} f(x) d x=\frac{1}{2}\left\{1+\operatorname{erf}\left(\frac{x-m}{\sqrt{2} \sigma}\right)\right\} $$
Es gilt
$$ f(x)=\frac{d}{d x} F(x) $$

Definition: Heaviside Funktion
$$ H(x-m)=\lim\_{\sigma \to 0} F(x)=\left\\{\begin{array}{ll} 1 & x>m \\\\ \frac{1}{2} & x=m \\\\ 0 & x

Cumulative Verteilungsfunktion von $\delta(x)$ ist $H(x)$ mit
$$ \begin{array}{l} H(x)=\displaystyle\int_{-\infty}^{x} \delta(x) d x \\\\ \delta(x)=\frac{d}{d x} H(x) \end{array} $$
Multivariate Diracsche Deltafunktion

Dirasche Mischdichten (Dirac Mixture)
$$ f(x)=\sum_{i=1}^{L} \omega_{i} \delta \left(x-x_{i}\right) $$
Multivariate Diracdichte
$$ \delta(\underline{x})=\delta\left(x_{1}\right) \cdot \delta\left(x_{2}\right) \cdot \ldots, \quad \underline{x}=\left[x_{1}, x_{2}, \ldots\right]^{\top} $$
Multivariate Dirasche Mischdichte
$$ f(\underline{x})=\sum_{i=1}^{L} \omega_{i} \delta\left(\underline{x}-\underline{x}_{i}\right) $$
Umrechnung SNV $\rightarrow$ Allgemeine Gaußdichte

(SNV = Standard Normalverteilung $\mathcal{N}(0, 1)$)

Natürliche Lösung für Problem

Verschiedene Möglichkeiten mit unterschiedlicher Komplexität und Effizienz

Angenommen: Wir haben ein Approximationsverfahren, das eine standardverteilung in merh-/höher-dimension approximieren kann.

Gegeben: Gaußdichte mit $\underline{\hat{x}}=\underline{0}$ und $\mathbf{C}_x = \mathbf{I}_N$ ($N$-dim. Einheitsmatrix)

Gesucht: Dichte mit beliebigen Mittelwert $\underline{\hat{y}}$ und Kovarianzmatrix $\mathbf{C}_y$

Wir machen Cholesky-Zerlegung
$$ \mathbf{C}_{y}=\mathcal{C}_{y} \cdot \mathcal{C}_{y}^{\top} $$
wobei $\mathcal{C}_y$ eine untere Dreiecksmatrix.

Umrechnung
$$ \underline{y}=\mathcal{C}_{y} \cdot \underline{x}+\underline{\hat{y}} $$
Beweis:
$$ E\{\underline{y}\}=E\left\{\mathcal{C}_{y} \cdot \underline{x}+\hat{y}\right\}=\mathcal{C}_{y} \underbrace{E\{\underline{x}\}}_{=\underline{0}}+\underbrace{E\{\hat{y}}_{=\underline{y}}\}=\underline{\hat{y}} $$ $$ \begin{aligned} \operatorname{Cov}\{\underline{y}\} &=E\left\{(\underline{y}-E\{\underline{y}\})(\underline{y}-E\{\underline{y})^{\top}\right\} \\ &=E\left\{(\underline{y}-\underline{\hat{y}})(\underline{y}-\underline{\hat{y}})^{\top}\right\} \\ &=E\left\{\mathcal{C}_{y} \cdot \underline{x} \cdot \underline{x}^{\top} \mathcal{C}_{y}^{\top}\right\}\\ &=\mathcal{C}_{y} \cdot \underbrace{E\left\{\underline{x}\underline{x}^{\top}\right\}}_{=\mathbf{C}_{x}=\mathbf{I}_{N}} \cdot \mathcal{C}{y}^{\top} \\ &=\mathcal{C}_{y} \cdot \mathbf{I}_{N} \cdot \mathcal{C}_{y}^{\top}=\mathcal{C}_{y} \cdot \mathcal{C}_{y}^{\top} = \mathbf{C}_{y} \end{aligned} $$
Minimale Approximation SNV auf Hauptachsen

1D-Fall

Die wahre Dichte $\tilde{f}(x)$ sei eine 1D Standardnormalverteilung (SNV). Die möchten wir darstellen über eine Dirac Mixture
$$ f(x)=w_{1} \delta\left(x-x_{1}\right)+w_{2} \delta\left(x-x_{2}\right) \qquad w_{1}, w_{2} \geqslant 0 $$
Gaußdichte ist symmetrisch $\Rightarrow$
$$ w_{1}=w_{2}=w, \quad x_{1}=-x_{2} $$
Integral soll gleich 1 sein.
$$ \int_{\mathbb{R}} f(x) d x=w_{1}+w_{2}=2 w \stackrel{!}{=} 1 \Rightarrow w=\frac{1}{2} $$
Erwartungswert:
$$ E_{f}\{x\}=0=E_{\tilde{f}}\{x\} $$
Varianz:
$$ E_{f}\left\{x^{2}\right\}=\int_{\mathbb{R}} x^{2} f(x) d x=w x_{1}^{2}+w x_{2}^{2}=2 w x_{1}^{2} \stackrel{!}{=} 1 \Rightarrow x_{1}^{2}=1 \Rightarrow x_1 = -1, x_2 = 1 $$
2D-Fall
$$ \begin{aligned} f(x, y)=& w_{1} \delta\left(x-x_{1}\right) \delta(y)+w_{2} \delta\left(x-x_{2}\right) \delta(y) & w_{1}, w_{2} \geqslant 0 \\ &+v_{1} \delta(x) \delta\left(y-y_{1}\right)+v_{2} \delta(x) \delta\left(y-y_{2}\right) & v_{1}, v_{2} \geqslant 0 \end{aligned} $$
Symmetrie $\Rightarrow$
$$ w_{1}=w_{2}=v_{1}=v_{2}=w, \quad x_{1}=-x_{2}, \quad v_{1}=-y_{2} $$
Integral = 1
$$ \int_{\mathbb{R}^{2}} f(x, y) d x d y=w\left\{\int_{\mathbb{R}} s\left(x-x_{1}\right) d x \int_{\mathbb{R}} f(y) d y+\ldots\right\}=4 w \stackrel{!}{=} 1 \Rightarrow w=\frac{1}{4} $$
Varianz
$$ \iint_{\mathbb{R}} x^{2} f(x, y) d x d y=w x_{1}^{2}+w x_{2}^{2}=2 w x_{1}^{2} \stackrel{!}{=} 1 \Rightarrow x_{1}^{2}=2 \Rightarrow x_1 = -\sqrt{2}, x_2 = \sqrt{2} $$
$x, y$ sind nicht unabhänging:
$$ f(x, y) \neq f(x) \cdot f(y), E\{x \cdot y\}=0 $$
N-dim Fall
$$ \begin{array}{c} w=\frac{1}{2 N} \quad \underline{x}=\left[x^{(1)}, x^{(2)}, \ldots\right]^{\top} \\ \Rightarrow \begin{equation} x_{1}^{(i)}=-\sqrt{N}, \quad x_{2}^{(i)}=+\sqrt{N}, \quad i=1, \ldots, N \end{equation} \end{array} $$
Ablauf des Filters mit Sampling der prioren Dichte

Messfunktion (Bsp.)
$$ y = x^3 + v $$
Priore Schätzung: Gaußdichte $\tilde{f}_{p}(x)=\mathcal{N}\left(x, \hat{x}_{p}, \sigma_{p}^{2}\right)$

Rauschen: $v \sim \tilde{f}_v(v) = \mathcal{N}(v, 0, \sigma_v^2)$

Approximation
$$ f_{p}(x)=\frac{1}{2} \delta\left(x-x_{1}\right)+\frac{1}{2} \delta\left(x-x_{2}\right) $$
wobei
$$ x_1 = \hat{x}_p - \sigma_p \quad x_2 = \hat{x}_p + \sigma_p $$ $$ f_v(v)=\frac{1}{2} \delta(\underbrace{x - \sigma_{v}}_{=v_{1}})+\frac{1}{2} \delta(\underbrace{x+\sigma_{v}}_{=v_{2}}) $$
Dann
$$ y_{i j}=x_{i}^{3}+v_{j} \qquad i=1,2 , j=1,2 $$
Wir sampeln für $x$ und $v$ jeweils 2 Samples. Dann kriegen wir 4 Paare $(x, y)$: $(x_1, y_{11}), (x_1, y_{12}), (x_2, y_{21}), (x_2, y_{22})$, also die 4 violette Punkte im Bild.

Wir nehmen an, dass $x, y$ gemeinsam Gaußverteilt sind. Dann berechnen wir mit dieser 4 Punkte den Mittelwert und Kovarianz, und fitten wir eine Gaußdichte (Moment matching).

Wir haben auch die Messung $\hat{y}$, die diese approximierte Gaußdichte schneidet. Mit $\hat{y}$ können wir jetzt den probabilistischen Kalman Filter durchführen.

Ensemble Kalmanfilter (EnKF)

Fri, 15 Jul 2022 00:00:00 +0000

Motivation

Prädiktionsschritt von Nichtlineares Kalmanfilter (NLKF) $\rightarrow$ speziell Variante sample-basiert

Durch Re-approximation mit Gaußdichte $\rightarrow$ Zusatzinformation verloren

Wenn keine Messungen vorliegen und mehrere Prädiktionsschritte nacheinander $\rightarrow$ Man kann temporär Approximation fortlassen

Filterschritt von NLKF
$$ \begin{array}{l} \underline{\hat{x}}_{e}=\underline{\hat{x}}_{p}+\mathbf{C}_{x y} \mathbf{C}_{y y}^{-1}(\underline{\hat{y}}-E\{\underline{h}(\underline{x})\}) \\ \mathbf{C}_{e}=\mathbf{C}_{p}-\mathbf{C}_{x y} \mathbf{C}_{y y}^{-1} \mathbf{C}_{y x} \end{array} $$
wobei
$$ \begin{array}{ll} \mathbf{C}_{x x}=\mathbf{C}_{p} \in \mathbb{R}^{N \times N}\quad &\mathbf{C}_{x y} \in \mathbb{R}^{N \times M} \\ \mathbf{C}_{y x} \in \mathbb{R}^{M \times N}\quad &\mathbf{C}_{yy} \in \mathbb{R}^{M \times M} \end{array} $$
Unabhängig von gewähltes Form der Momenteberechnung $\rightarrow$ Hoher Aufwand für Berechnung und Speichern der Kovairanzmatrizen 🤪

Idee

Beibehaltung der Samples nach Prädiktionsschritt $\rightarrow$ Keine Re-approximation durch Gauß

Damit bleibt Forminformation erhalten und Unsicherheit wird in samples gespeichert.

Speicherkomplexität

Kalmanfilter (KF)

Erwartungswert; $N$

Kovarianzmatrix $\frac{N(N+1)}{2}$

$\Rightarrow$ Insgesamt $\frac{N^2 + 3N}{2}$

EnKF

Ein sample: $N$

$L$ samples: $L \cdot N$ (z.B mit sampling auf der Hauptachse gilt $L = 2N \rightarrow 2N^2$)

Aber: spart Aufwand bei Berechnung der Kovarianzmatrix

🎯 Ziel: Rekursive Berechnung des Prädiktionsschritts

Herausforderungen

Gegeben

$L$ Samples $\underline{x}_{k, i}, i = 1, \dots, L$

Systemabbildung
$$ \underline{x}_{k+1} = \underline{a}_k(\underline{x}_k, \underline{w}_k) $$

Gesucht: $L^\prime$ Samples $\underline{x}_{k, i+1}, i = 1, \dots, L^\prime$

Wir benötigen Samples für $\underline{w}_k$: $\underline{w}_{k, j}, j = 1, \dots, Q$

‼️ Problem: Abbildung der Kombination aller Samples $\Rightarrow$ Kartesisches Produkt!!! $\Rightarrow$ Anzahl der Samples steigt bei rekursiver Prädiktion exponentiell !!!

Lösungsidee: Begrenzung der Abtastwerte

Ziel: Einstellbare Anzahl an Samples $\rightarrow$ um Komplexität zu folgen

Einfacher Fall: Konstante Anzahl Samples über Zeit

Anzatz 1: Über Reduktion

Prior

Posterior (also Reduktion von $\underline{x}_{k+1, i}$ )

braucht $L \cdot Q$ Abbildungen

Ergebnis aber oft besser

Ansatz 2: Anzahl von Parren mit Latin Hypercube Sampleing (LHS)

Jede Zeile und Spalte darf NUR ein Element erhalten

Optimale Wahl schwierig

Diskretes Gütemaß ist i.d.R. zimliche kompliziert

Triviale praktische Umsetzung: Ziehe (Konstante) Samples aus $\underline{w}_k$ für jedes $\underline{x}_{k, i}$ (aber schlecht für wenige Samples)

Anordnung
$$ \mathcal{X}_{k}=[\underbrace{\underline{x}_{k, 1}}_{\mathbb{R}^N}, \underline{x}_{k, 2}, \ldots, \underline{x}_{k, L}] \in \mathbb{R}^{N \times L}, \quad \mathcal{W}_{k}=\left[\underline{w}_{k, 1}, \underline{w}_{k, 2}, \ldots, \underline{w}_{k, L}\right] \in \mathbb{R}^{N \times L} $$

Jede $\underline{x}_{k, i}$ und $\underline{w}_{k, j}$ ist ein Vektor.

$\underline{a}_k$ überladen:
$$ \mathcal{X}_{k+1} = \underline{a}_k(\mathcal{X}_{k}, \mathcal{W}_{k}) $$

Filterschritt

🎯 Ziel

Durchführung der Filterschritt NUR mit Samples

Direkte Überführung der prioren Samples in posteriore Samples

Vermeidung der Verwendung der Update-Formeln für Kovarianzmatrix

Reine Representation der Unsicherheiten durch Samples

Lineare Messungabbildung
$$ \underline{y}=\mathbf{H} \cdot \underline{x}+\underline{v} $$
Für gegebene Messung $\hat{y}$:
$$ \underbrace{\underline{\hat{y}}-\underline{v}}_{=:\hat{\mathcal{Y}}}=\mathbf{H} \cdot \underline{x} $$
Mess-sampleset:
$$ \hat{\mathcal{Y}}=\underline{\hat{y}} \cdot \underline{\mathbb{1}}^{\top}-\mathcal{V} \qquad \mathcal{V}=\left[\underline{v}_{1}, \underline{v}_{2}, \ldots, \underline{v}_{L}\right] $$

Damit ist Update des Zustands in “combination form”
$$ \mathcal{X}_{e}=(\mathbf{I}-\mathbf{K} \mathbf{H}) \mathcal{X}_{p}+\mathbf{K} \mathcal{\hat{Y}} $$

$\mathcal{X}$ und $\mathcal{Y}$ sind Matrizen

wäre begrenzt auf additives Rauschen, aber funktioniert direkt für nichtlineare Messabbildung $\underline{y}=\underline{h}(\underline{x}, \underline{v})$.

Alternative Herleitung

Prädizierte Mess-samples basierend auf prioren Samples und Rauschen-samples:
$$ \mathcal{Y} = \mathbf{H} \cdot \mathcal{X}_p + \mathcal{V} $$

Update des Zustands in “feedback form”
$$ \begin{aligned} \mathcal{X}_e &= \mathcal{X}_p + \mathbf{K}(\underbrace{\underline{\hat{y}} \cdot \underline{\mathbf{1}}^\top}_{\text{gemessen}} - \underbrace{\mathcal{Y}}_{\text{Prädiktion}}) \\\\ &= \mathcal{X}_e + \mathbf{K}(\underline{\hat{y}} \cdot \underline{\mathbf{1}}^\top - \mathbb{H} \mathcal{X}_p - \mathcal{V})\\\\ &= (\mathbb{I} - \mathbf{K}\mathbf{H})\mathcal{X}_p + \mathbf{K}(\underbrace{\underline{\hat{y}} \cdot \underline{\mathbf{1}}^\top - \mathcal{V}}_{=\hat{\mathcal{Y}}}) \end{aligned} $$

Allgemeine Systeme

Sun, 17 Jul 2022 00:00:00 +0000

Motivation

Sun, 24 Jul 2022 00:00:00 +0000

Bisher: Systeme immer durch Gaußdichte repräsentiert.

Systemgleichung
$$ \underline{x}_{k+1} = \underline{a}_k (\underline{x}_k, \underline{w}_k) $$
kann durch Transitionsdichte $f(\underline{x}_{x+1} | \underline{x}_k)$ beschrieben werden.

Messgleichung
$$ \underline{y}_k = \underline{h}_k (\underline{x}_k, \underline{v}_k) $$
kann durch Likelihhod $f(\underline{y}_k | \underline{x}_k)$ beschrieben werden.

Allgemein für beide Gleichung:
$$ \underline{z} = \underline{h}(\underline{x}, \underline{v}) $$
Für lienare Systeme: Repräsentation durch Gaußdichte $\mathcal{N}(x, \mu, \sigma)$ ist in Ordnung
$$ z = Hx + v $$
Erwartungswert
$$ E(z | x) = Hx $$
Kovarianz
$$ \operatorname{Cov}(z \mid x)=E\left(\left[z-E(z|x)\right]^{2} \mid x\right)=\sigma_{v}^{2} $$
Daher
$$ \begin{aligned} f(z \mid x) &=\mathcal{N}\left(z, H \cdot x, \sigma_{v}\right) \\\\ & \propto \exp \left(-\frac{1}{2} \frac{(z-H \cdot x)^{2}}{\sigma_{v}^{2}}\right) \quad | \text { Gauß in } z \\\\ & \propto \exp \left\{-\frac{1}{2} \frac{\left(x-\frac{z}{H}\right)^{2}}{\left(\sigma_{v} / H\right)^{2}}\right\} \quad | \text { Gauß in } x \\\\ & = \mathcal{N}(x, \frac{z}{H}, \frac{\sigma_v}{H}) \end{aligned} $$
Aber im Allgemein für
$$ z = h(x) + v \tag{additives Rauschen} $$
ist $f(z|x)$ NICHT Gauß in $x$!!!

Wir benötigen eine Methode zur Berechnung von $f(z|x)$ im allgemeinen Fall. 💪

Dirac’sche Deltafunktion

Sun, 24 Jul 2022 00:00:00 +0000

Mehr zu Dirac’sche Deltafunktion siehe:

Eigenschaften

Symmetrie
$$ \delta (x) = \delta (-x) $$
Skalierung
$$ \delta (ax) = \frac{1}{|a|}\delta (x) $$
Kompizierte Argumente
$$ \delta (g(x)) = \sum_{i=1}^N \frac{1}{|g^\prime(x_i)|}\delta (x - x_i) $$
wobei

$g(x_i) = 0$ (also $x_i$ sind Nullstellen, $i = 1, 2, \dots, N$)

$g^\prime(x_i) \neq 0$

Ableitung der Heaviside Step Funktion
$$ \delta(x) = \frac{d}{dx} H(x) $$
wobei $H(x)$ ist die Heaviside Step Funktion
$$ H(x):= \begin{cases}1, & x>0 \\ 0, & x \leq 0\end{cases} $$

Funktionen von Zufallsvariablen

Sun, 24 Jul 2022 00:00:00 +0000

Abbildung
$$ y = h(x) $$

Gegeben: $x \sim f_x(x)$

Gesucht: $y \sim f_y(y)$

Verbunddichte
$$ f_{xy}(x, y) = f(y | x) \cdot f_x(x) $$
$f(y|x)$ kann als probabilistische Beschreibung der Abbildung anfassen.

Dichte von $y$
$$ f_{y}(y)=\int_{\mathbb{R}} f_{x y}(x, y) d x=\int_{\mathbb{R}} f(y \mid x) \cdot f_{x}(x) d x $$
Probabilistische Abbildung:
$$ f(y|x) = \delta(y - h(x)) $$
Damit folgt
$$ f_y(y)=\int_{\mathbb{R}} \delta(\underbrace{y-h(x)}_{g(x)}) f_x(x) d x $$
Beispiel

Beispiel 1

Gegeben
$$ y = \frac{1}{x} \qquad x \sim f_x(x) $$
Probabilistische Abbildung:
$$ f(y|x) = \delta(\underbrace{y - \frac{1}{x}}_{=g(x)}) $$ $$ g(x_1) = 0 \Rightarrow x_1 = \frac{1}{y} \qquad g^\prime(x) = \frac{1}{x^2} $$
Laut
$$ \delta (g(x)) = \sum_{i=1}^N \frac{1}{|g^\prime(x_i)|}\delta (x - x_i) $$
gilt
$$ f(y|x) = \delta(y - \frac{1}{x}) = x_1^2 \delta(x - \frac{1}{y}) = \frac{1}{y^2} \delta(x - \frac{1}{y}) $$ $$ \begin{aligned} f_y(y) &= \int_{\mathbb{R}} f(y | x) \cdot f_{x}(x) d x \\ &= \int_{\mathbb{R}} \frac{1}{y^2} \delta(x - \frac{1}{y}) f_{x}(x) d x \\ &= \frac{1}{y^2} f_{x}(\frac{1}{y}) \end{aligned} $$
Z.B., wenn $x$ gaußverteilt, also $f_x(x) = e^{-x^2}$ , dann kann man die Dichte von $y$ sofort berechnen:
$$ f_y(y) = \frac{1}{y^2} e^{-\frac{1}{y^2}} $$
Beispiel 2: Quadratic Function
$$ \begin{aligned} &\delta(g(x))=\delta\left(y-a x^{2}\right), \quad a>0 \\ &\Rightarrow g(x)=y-a x^{2} \\ &\Rightarrow g^{\prime}(x)=-2 a x \end{aligned} $$
Fallunterscheidung
$$ \begin{aligned} &g\left(x_{i}\right)=0 \\ &y \geq 0: N=2, \quad x_{1}=\sqrt{\frac{y}{a}}, \quad x_{2}=-\sqrt{\frac{y}{a}} \\ &y<0: N=0, \quad \text { no roots. } \end{aligned} $$ $$ \begin{aligned} f(y|x) &= \delta\left(y-a x^{2}\right) \\\\ &= \begin{cases}\frac{1}{\left|g^{\prime}\left(x_{1}\right)\right|} \delta\left(x-x_{1}\right)+\frac{1}{\left|g^{\prime}\left(x_{2}\right)\right|} \delta\left(x-x_{2}\right) & , y \geq 0 \\ 0 & , y<0\end{cases} \\\\ &= \begin{cases}\frac{1}{2 \cdot \sqrt{a y}}\left(\delta\left(x-\sqrt{\frac{y}{a}}\right)+\delta\left(x+\sqrt{\frac{y}{a}}\right)\right) & , y \geq 0 \\ 0 & , y<0\end{cases} \\\\ \end{aligned} $$ $$ \begin{aligned} f_y(y) &= \int_{\mathbb{R}} f(y | x) \cdot f_{x}(x) d x \\ &= \frac{1}{2 \sqrt{a y}}\left\{f_{x}\left(-\sqrt{\frac{y}{a}}\right)+f_x\left(\sqrt{\frac{y}{a}}\right)\right\} \cdot u(y) \qquad u(y)= \begin{cases}1, & y \geqslant 0 \\ 0, & \text { sonst }\end{cases} \end{aligned} $$
Für $f_x(x) = \mathcal{N}(x, 0, \sigma)$ :
$$ f_{y}(y)=\frac{1}{\sqrt{2 \pi a y}} \exp \left\{-\frac{1}{2} \frac{y}{a \sigma^{2}}\right\} u(y) $$

Probabilistische Systemmodelle

Sun, 24 Jul 2022 00:00:00 +0000

Mit Additivem Rauschen

Allgemein:
$$ \underline{z} = \underline{a}(\underline{x}) + \underline{v} $$
$\Rightarrow$
$$ f(\underline{z} \mid \underline{x})=f_v(\underline{z}-\underline{a}(\underline{x})) $$
Beispiel:
$$ z = x^2 + v \qquad v \sim f_v(v) $$
Gesucht: $f(z|x)$
$$ f(z \mid x, v)=\delta\left(z-x^{2}-v\right), \quad f(z, v \mid x)=f(z \mid x, v) \cdot f_v(v) $$ $$ \begin{aligned} f(z \mid x) &\overset{\text{Marginalisierung}}{=}\int_{\mathbb{R}} f(z, v \mid x) d v\\ &=\int_{\mathbb{R}} f(z \mid x, v) \cdot f_v(v) d v \\ &=\int_{\mathbb{R}} \delta\left(z-x^{2}-v\right) \cdot f_v(v) d v \\ &=f_{v}\left(z-x^{2}\right) \end{aligned} $$
In dem Fall
$$ z = x_{k + 1} \quad x = x_{k}, $$
heißt
$$ f_v(z \mid x) = f_v(x_{k+1} \mid x_k) = f_v(x_{k+1} - a(x_k)) \tag{additive} $$
Transitionsdichte (Engl. transition density).

Mit Multiplikativem Rauschen

Abbildung
$$ z = x \cdot v \quad v \sim \mathcal{N}(v, 0, \sigma_v) $$
Annahme: $z, x, v$ sind positiv.

Gesucht: $f(z \mid x)$

Rückführung auf additiven Fall mit $\log(\cdot)$:
$$ \underbrace{\log (z)}_{\bar{z}}=\log (x \cdot v)=\underbrace{\log (x)}_{\bar{x}}+\underbrace{\log (v)}_{\bar{v}} \Leftrightarrow \bar{z}=\bar{x}+\bar{v} $$
Dichte von $\bar{v} = \log(v)$ :
$$ f(\bar{v} \mid v) = \delta(\bar{v} - \log(v)) = \exp(\bar{v})\delta(v - \exp(\bar{v})) $$ $$ \begin{aligned} f_\bar{v}(\bar{v}) &= \int_{\mathbb{R}} f(\bar{v} \mid v) f_v(v) dv \\\\ &= \int_{\mathbb{R}} \exp(\bar{v})\delta(v - \exp(\bar{v})) f_v(v) dv \\\\ &= \exp(\bar{v}) f_v(\exp(\bar{v})) \\\\ &= \frac{1}{\sqrt{2 \pi} \sigma_{v}} \exp (\bar{v}) \exp\left\{-\frac{1}{2} \frac{[\exp(\bar{v})]^{2}}{\sigma_{v}^{2}}\right\} \end{aligned} $$
Dann
$$ \begin{aligned} f(\bar{z} \mid \bar{x}) &= f_\bar{v}(\bar{z} - \bar{x}) \\ &= \frac{1}{\sqrt{2 \pi} \sigma_{v}} \exp \{\bar{z} - \bar{x}\} \exp\left\{-\frac{1}{2} \frac{[\exp(\bar{z} - \bar{x})]^{2}}{\sigma_{v}^{2}}\right\} \end{aligned} $$ $$ \begin{aligned} z = \exp\{\bar{z}\} &\Rightarrow g(\bar{z}) = z - \exp(\bar{z}) \\ &\Rightarrow g^{\prime}(\bar{z}) = -\exp(\bar{z}) \quad \text{Nullstelle}: \bar{z} = \log(z) \end{aligned} $$ $$ f(z \mid \bar{x}) = \frac{1}{|z|} f(\log(z) \mid \bar{x}) $$
$x = \exp(\bar{x}) \Rightarrow$
$$ f(z \mid x)=\frac{1}{\sqrt{2 \pi} \sigma_{v}} \frac{1}{|x|} \exp \left\{-\frac{1}{2} \frac{z^{2}}{\sigma_{v}^{2} x^{2}}\right\} $$
Direkte Lösung:
$$ f(z \mid x, v) = \delta(z - x \cdot v) $$ $$ f(z, v \mid x) = f(z \mid x, v) \cdot f_v(v) = \delta(z - x \cdot v) f_v(v) $$ $$ f(z \mid x) = \int_{\mathbb{R}} f(z, v \mid x) dv = \int_{\mathbb{R}}\delta(z - x \cdot v) f_v(v) dv $$
Setze
$$ \begin{aligned} g(v) := z - xv &\Rightarrow g^\prime(v) = -x, \quad \text{Nullstelle } v = \frac{z}{x} \end{aligned} $$
Daher
$$ \begin{aligned} f(z \mid x)&=\int_{\mathbb{R}} \frac{1}{|x|} \delta\left(v-\frac{z}{x}\right) \cdot f_v(v) d v \\ &=\frac{1}{|x|} \cdot f_v\left(\frac{z}{x}\right) \qquad \qquad (\text{multiplicative}) \end{aligned} $$
Mixed Additive and Multiplicative Noise (Script Chp. 9.2.2)

System equation
$$ x_{k+1} = x_k v_k + w_k $$
with additive noise $w_k$ and multiplicative noise $v_k$. The noise termsare jointly distributed according to $f_{k}^{vw}(v_k, w_k)$.

The joint density of the state at time step $k+1$ is
$$ f\left(x_{k+1}, v_{k}, w_{k} \mid x_{k}\right)=f\left(x_{k+1} \mid x_{k}, v_{k}, w_{k}\right) f_{k}^{v w}\left(v_{k}, w_{k}\right), $$
where according to the system equation the density of the state at time step $k + 1$ conditioned on the state at time step $k$ and the noise terms $v_k$ and $w_k$ is
$$ f(x_{k+1} \mid x_{k}, v_{k}, w_{k}) = \delta(x_{k+1} - x_{k}v_{k} - w_{k}). $$
The desired transition density is now given by
$$ \begin{aligned} f\left(x_{k+1} \mid x_{k}\right) &=\int_{\mathbb{R}} \int_{\mathbb{R}} f\left(x_{k+1}, v_{k}, w_{k} \mid x_{k}\right) d w_{k} d v_{k} \\ &=\int_{\mathbb{R}} \int_{\mathbb{R}} \delta\left(x_{k+1}-x_{k} v_{k}-w_{k}\right) f_{k}^{v w}\left(v_{k}, w_{k}\right) \mathrm{d} w_{k} \mathrm{~d} v_{k}\\ &\overset{\text{additive}}{=} f_{k}\left(x_{k+1} \mid x_{k}\right)=\int_{\mathbb{R}} f_{k}^{v w}\left(v_{k}, x_{k+1}-x_{k} v_{k}\right) \mathrm{d} v_{k} \mid v_k, w_k \text{ independent}\\ &=\int_{\mathbb{R}} f_{k}^{v}\left(v_{k}\right) f_{k}^{w}\left(x_{k+1}-x_{k} v_{k}\right) \mathrm{d} v_{k} \end{aligned} $$
These expressions cannot in general be solved analytically.

Abstraktion

Wed, 27 Jul 2022 00:00:00 +0000

Skript 10.1, 10.2

Abstrahierte Systembeschreibung & Eigenschaften

Alle Komponenten eines Systems können durch

beschrieben werden ($\underline{a} \in \mathbb{R}^A, \underline{b}\in \mathbb{R}^b$ ) .

Kauselität: $a$ (Grund) bewrikt $b$ (Wirkung).

Für $\underline{a}$ gegeben, $f(\underline{b} \mid \cdot)$ heißt Transitionsdichte.

Für $\underline{b}$ gegeben, $f(\cdot \mid \underline{a})$ heißt Likelihood.

Eigenschaften von probabilistischer Systembeschreibung

In Allg. gilt
$$ \int_{\mathbb{R}^{B}} f(\underline{b} \mid \underline{a}) d \underline{b}=1 \quad \forall \underline{a} $$
Es gilt aber i.A.
$$ \int_{\mathbb{R}^{A}} f(\underline{b} \mid \underline{a}) d \underline{a} \neq 1, $$
sogar nicht definiert.

Vorwärts-/Rückwärtsinferenz

Vorwärtsinferenz

“Given information about $\underline{a}$, we desire information about $\underline{b}$.”

Gegeben: Werte für $\underline{\hat{a}}$ oder Dichte $f(\underline{a})$

Gesucht: $f(\underline{b})$

Rückwärtsinferenz

“Information about the output $\underline{b}$ is given and we desire to reconstruct an appropriate description of $\underline{a}$.”

Gegeben: Werte für $\underline{\hat{b}}$ oder Dichte $f(\underline{b})$

Gesucht: $f(\underline{a})$

Vorwärtsinferenz

Übungsblatt Aufg. 9.1

Annahme: KEIN Vorwissen über $f(\underline{b})$

Betrachte eine einfache generative Systemabbildung:
$$ \underline{b} = \underline{g}(\underline{a}) \quad \underline{a} \in \mathbb{R}^A, \underline{b} \in \mathbb{R}^B $$
Probablistische Systemabbildung:
$$ f(\underline{b} \mid \underline{a}) = \delta(\underline{b} - \underline{g}(\underline{a})) $$
Marginalisierung ergibt:
$$ \begin{aligned} f(\underline{b}) &= \int_{\mathbb{R}^A} f(\underline{a}, \underline{b}) d\underline{a} \\\\ &= \int_{\mathbb{R}^A} f(\underline{a} \mid \underline{b}) f(\underline{a}) d\underline{a} \\\\ &= \int_{\mathbb{R}^A} \delta(\underline{b} - \underline{g}(\underline{a})) f(\underline{a}) d\underline{a} \end{aligned} $$
Weitere Vereinfachung NUR für konkrete $\underline{g}(\cdot)$ möglich.

Für Speizialfall der Vorgabe eines Wertes $\underline{\hat{a}}$ ergibt sich
$$ f(\underline{a}) = \delta(\underline{a} - \underline{\hat{a}}) $$
Damit
$$ \begin{aligned} f(\underline{b}) &= \int_{\mathbb{R}^A} \delta(\underline{b} - \underline{g}(\underline{a})) f(\underline{a}) d\underline{a} \\\\ &= \int_{\mathbb{R}^A} \delta(\underline{b} - \underline{g}(\underline{a})) \delta(\underline{a} - \underline{\hat{a}}) d\underline{a} \\\\ &= \delta(\underbrace{\underline{b} - g(\underline{\hat{a}})}_{\underline{\hat{b}}}) \end{aligned} $$
Das erwartete Ergebnis ist dann
$$ f(\underline{b}) = \delta(\underline{b} - \underline{\hat{b}}) $$
mit $\underline{\hat{b}} = \underline{g}(\underline{\hat{a}})$.

Probabilistisches nichtlineares Systemmodell

Allgemeines Systemmodell
$$ \underline{x}_{k+1} = \underline{a}_k(\underline{x}_k, \underline{w}_k) $$
in Form $f(\underline{b} \mid \underline{a})$ bringen:
$$ \underline{a}=\left[\begin{array}{c} \underline{x}_{k} \\ \underline{w}_{k} \end{array}\right], \quad \underline{b}=\underline{x}_{k+1} $$ $$ f(\underline{b} \mid \underline{a}) = \delta \left(\underline{x}_{k+1} - \underline{a}_k(\underline{x}_k, \underline{w}_k)\right) $$
Mit anderen Systemgrenzen:

$$ \begin{aligned} f(\underline{b} \mid \underline{a}^\prime) &= f(\underline{x}_{k+1} \mid \underline{x}_k) \\\\ &= \int_{\mathbb{R}^N} \underbrace{f(\underline{x}_{k+1} \mid \underline{x}_k, \underline{w}_k)}_{f(\underline{b} \mid \underline{a})} \cdot f(\underline{w}_k) d\underline{w}_k \end{aligned} $$
In diesem Fall enthält $f(\underline{b} \mid \underline{a})$ Systemrauschen $\rightarrow$ ist nicht mehr durch $\delta$-funktion beschreibbar.

Prädiktion nichtlinearer Systeme

Wed, 27 Jul 2022 00:00:00 +0000

Skript 10.2, 10.3

Chapman-Kolmogorov-Gleichung

Übungsblatt Aufg. 10.1

Verbunddichte
$$ f\left(\underline{x}_{k+1}, \underline{x}_{k}\right)=f\left(\underline{x}_{k+1} \mid \underline{x}_{k}\right) \cdot f\left(\underline{x}_{k}\right) $$
Marginalisierung
$$ f\left(x_{k+1}\right)=\int_{\mathbb{R}^{N}} f\left(\underline{x}_{k+1} \mid \underline{x}_{k}\right) \cdot f\left(\underline{x}_{k}\right) d \underline{x}_{k} $$
Definition

geschätzte Dichte im Zeitschritt $k$ einschließlich der letzten Messung
$$ f_{k}^{e}\left(\underline{x}_{k}\right)=f\left(\underline{x}_{k} \mid \underline{\hat{y}}_{k}, \underline{\hat{y}}_{k-1}, \ldots, \underline{\hat{y}}_{1}, \underline{\hat{u}}_{k-1}, \underline{\hat{u}}_{k-2}, \ldots, \underline{\hat{u}}_{0}\right) $$

Prädiktion der Dichte im Zeitschritt $k+1$ (Messung nicht inklusive)
$$ f_{k+1}^{p}\left(\underline{x}_{k+1}\right)=f\left(\underline{x}_{k+1} \mid \underline{\hat{y}}_{k}, \underline{\hat{y}}_{k-1}, \ldots, \underline{\hat{y}}_{1}, \underline{\hat{u}}_{k}, \underline{\hat{u}}_{k-1}, \ldots, \underline{\hat{u}}_{0}\right) $$

Prädiktion für dynamische Systeme ( Chapman-Kolmogorov-Gleichung)
$$ f_{k+1}^{p}\left(\underline{x}_{k+1}\right)=\int_{\mathbb{R}^{N}} \underbrace{f\left(\underline{x}_{k+1} \mid \underline{x}_{k}\right)}_{\text{Prädiktionsdichte}} f_{k}^{e}\left(\underline{x}_{k}\right) \mathrm{d} \underline{x}_{k} $$
Erklärung

Üb A10.1

Die Chapman-Kolmogorov-Gleichung berechnet die Dichte von $\underline{x}_{k+1}$ aus einer gegebenen Dichte $f_{k}^{e}\left(\underline{x}_{k}\right)$ von $\underline{x}_{k}$ , während die probabilistische Systembeschreibung $f\left(\underline{x}_{k+1} \mid \underline{x}_{k}\right)$ die Dichte von $\underline{x}_{k+1}$ für einen konkreten Wert von $\underline{x}_{k}$ aus.

$f\left(\underline{x}_{k+1} \mid \underline{x}_{k}\right)$

das probablistische Systemmodell, welches eine Wahrscheinlichkeitsdichte für den nächsten Zustand $\underline{x}_{k+1}$ zu einem gegebenen aktuellen Zustand $\underline{x}_{k}$ ausgibt.

Diese Transitionsdichte $f\left(\underline{x}_{k+1} \mid \underline{x}_{k}\right)$ können wir aus dem gegebenen Systemmodell $\underline{x}_{k+1} = \underline{a}(\underline{x}_{k}, \underline{u}_{k}, \underline{v}_{k})$ berechnen - es ist einfach die probabilistische Darstellung davon
$$ f\left(\underline{x}_{k+1} \mid \underline{x}_{k}\right) = \int_{\mathbb{R}^N} \delta(\underline{x}_{k+1} - \underline{a}(\underline{x}_{k}, \underline{u}_{k}, \underline{v}_{k})) \cdot f_k^v(\underline{v}_k) d \underline{v}_k $$

$f_{k}^{e}\left(\underline{x}_{k}\right)$
die beste Schätzung, die wir über den Systemzustand zum Zeitpunkt $k$ haben, gegeben als Wahrscheinlichkeitsdichte

$f_{k+1}^{p}\left(\underline{x}_{k+1}\right)$

die beste Prädiktion des Zustands zum Zeitpunkt $(k+1)$, die sich aus dem Wissen über den Zustand $f_{k}^{e}\left(\underline{x}_{k}\right)$ und dem Systemmodell $\underline{x}_{k+1} = \underline{a}(\underline{x}_{k}, \underline{u}_{k}, \underline{v}_{k})$ (generative Darstellung) bzw. $f\left(\underline{x}_{k+1} \mid \underline{x}_{k}\right)$ (probabilistische Darstellung) berechnen lässt.

Bei einer Prädiktion wird die (relative) Unsicherheit generell größer.

Problem

‼️ Es handelt sich um ein Parameterintegral!

Integrand hängt von $\underline{x}_{k+1}$ ab (lässt sich i.Allg nicht herausziehen)

Nur möglich für analytische Lösung

Sonst erfordert (numerische) Lösung des Integrals für alle $\underline{x}_{k+1}$

Weiter nützliche Form der CK-Gleichung
$$ \begin{aligned} f(\underline{x}_{k + 2}, \underline{x}_{k}) &= \int_{\mathbb{R}^N} f(\underline{x}_{k+2}, \underline{x}_{k+1}, \underline{x}_{k}) d\underline{x}_{k+1} \\\\ f(\underline{x}_{k + 2} \mid \underline{x}_{k}) f(\underline{x}_{k}) &= \int_{\mathbb{R}^N} f(\underline{x}_{k+2} \mid \underline{x}_{k+1}, \underline{x}_{k}) f(\underline{x}_{k+1}, \underline{x}_{k}) d\underline{x}_{k+1} \quad | \quad \text{Markov} \\\\ f(\underline{x}_{k + 2} \mid \underline{x}_{k}) f(\underline{x}_{k}) &= \int_{\mathbb{R}^N} f(\underline{x}_{k+2} \mid \underline{x}_{k+1}) f(\underline{x}_{k+1}, \underline{x}_{k}) d\underline{x}_{k+1} \\\\ f(\underline{x}_{k + 2} \mid \underline{x}_{k}) f(\underline{x}_{k}) &= \int_{\mathbb{R}^N} f(\underline{x}_{k+2} \mid \underline{x}_{k+1}) f(\underline{x}_{k+1} \mid \underline{x}_{k}) f(\underline{x}_{k})d\underline{x}_{k+1} \\\\ f(\underline{x}_{k + 2} \mid \underline{x}_{k}) &= \int_{\mathbb{R}^N} f(\underline{x}_{k+2} \mid \underline{x}_{k+1}) f(\underline{x}_{k+1} \mid \underline{x}_{k}) d\underline{x}_{k+1} \end{aligned} $$
Prädiktion mit CK-Glg.: Lösungsansätze

Im allgemeinen Fall ist CK-Gleichung NICHT exakt lösbar 🤪

Ausnahme (Bsp.)

System ist linear und $f_{k}^{e}(\cdot)$ kann durch erste zwei Momente beschrieben werden

$f_{k}^{e}(\underline{x}_k)$ ist durch Abstastwerte repräsentiert.
$$ \begin{aligned} & f_{k}^{e}\left(\underline{x}_{k}\right)=\sum_{i=1}^{L} w_{i} \delta\left(\underline{x}_{k}-\hat{\underline{x}}_{k, i}\right) \qquad w_i \geq 0, \sum_i w_i = 1\\ \Rightarrow \qquad & f_{k=1}^{p}\left(\underline{x}_{k+1}\right)=\sum_{i=1}^{L} w_{i} f\left(\underline{x}_{k+1} \mid \hat{\underline{x}}_{k, i}\right) \end{aligned} $$

Vereinfachte Prädiktion

Systemmodell mit additivem Rauschen

Wir beginnen mit additivem Rauschen.

Generatives Modell
$$ \underline{x}_{k+1} = \underline{a}_{k}(\underline{x}_{k}) + \underline{w}_{k} \qquad \underline{x}_{k+1}, \underline{x}_{k}, \underline{w}_{k} \in \mathbb{R}^N $$
Vereinfachte Schreibweise
$$ \underline{z} = \underline{a}(\underline{x}) + \underline{w} $$
Probablistisches Modell (inkl. Rauschen)
$$ f(\underline{z} \mid \underline{x}) = f_w(\underline{z} - \underline{a}(\underline{x})) $$
Vereinfachung: Aufteilung in diskrete “Streifen”:
$$ f\left(\underline{z} \mid \underline{\hat{x}}_{i}\right)=f_w\left(\underline{z}-\underline{a}\left(\underline{\hat{x}}_{i}\right)\right) \qquad i \in \mathbb{Z} $$
In den “Zwischenräumen” gilt nun aber $\int f(\underline{z} \mid \underline{x}) = 1$ NICHT. Wir definiere eine “Füllfunktion” $f_i(\underline{x})$:
$$ f_i(\underline{x}) = \mathcal{N}(\underline{x}, \underline{\hat{x}}_i, C_i) \qquad i \in \mathbb{Z} $$
mit
$$ f(\underline{x}) = \sum_{i \in \mathbb{Z}} w_if_i(\underline{x}) \approx 1 $$

Z.B. Skalarer Fall
$$ > f(x)=\sum_{i \in \mathbb{Z}} w_{i} f_i(x), \quad f_i(x)=\exp \left(-\frac{1}{2} \frac{\left(x-\hat{x}_{i}\right)^{2}}{\sigma^{2}}\right) > $$
mit geeigneten $\sigma$.

Betrachtung für jeweils ein Komponente $i$
$$ f_i(\underline{z} \mid \underline{x}) = f(\underline{z} \mid \underline{\hat{x}}_i) \cdot f_i(\underline{x}) $$
Gesamtdichte ist
$$ f(\underline{z} \mid \underline{x}) \approx \sum_{i \in \mathbb{Z}} w_i f(\underline{z} \mid \underline{\hat{x}}_i) \cdot f_i(\underline{x}) $$
Es gilt
$$ \begin{aligned} \int_{\mathbb{R}^{N}} f(\underline{z}(\underline{x}) d \underline{z}&=\sum_{i \in \mathbb{Z}} w_{i} f_{i}(\underline{x}) \underbrace{\int_{\mathbb{R}^N}f(\underline{z} \mid \underline{x}) d\underline{z}}_{=1}\\ &=\sum_{i \in \mathbb{Z}} w_{i} f_{i}(\underline{x}) \approx 1 \end{aligned} $$
Fall Rauschen $\underline{w}_k$ Gaußverteilt, ist $f(\underline{z} \mid \underline{x} )$ Gaussian Mixture
$$ f(\underline{z} \mid \underline{x}) = \sum_{i \in \mathbb{Z}} \underbrace{f_i^z(\underline{z})}_{f_w(\underline{z} - \underline{a}(\underline{\hat{x}}_i))} \cdot f_i^x(\underline{x}) $$
Allgemeine Systemmodelle
$$ \underline{x}_{k+1} = \underline{a}_k (\underline{x}_k, \underline{w}_k) $$
Vereinfachte Schreibweise:
$$ \underline{z} = \underline{a}(\underline{x}, \underline{w}) $$
Ergibt allgemeine Transitionsdichte $f(\underline{z} | \underline{x})$, auch durch Mixture approximierbar
$$ f(\underline{z} | \underline{x}) = \sum_{i \in \mathbb{Z}} w_i f_i^z(\underline{z}) \cdot f_i^x(\underline{x}) $$
Wichtig ist, dass die einzelnen Komponenten entkoppelt sind. 👏

Bsp 1

Annahme: $f(\underline{x})$ ist eine Gaußdichte.

Einsetzen in CK-Gleichung:
$$ \begin{aligned} f(\underline{z})&=\int_{\mathbb{R}^{N}}\left(\sum_{i \in \mathbb{Z}} w_{i} f_{i}^{z}(\underline{z}) \cdot f_{i}^{x}(\underline{x})\right) f(\underline{x}) d \underline{x}\\ &=\sum_{i \in \mathbb{Z}} w_{i} f_{i}^{z}(\underline{z}) \underbrace{\int_{\mathbb{R}^{N}} f_{i}^{x}(\underline{x}) \cdot f(\underline{x}) d \underline{x}}_{\text{Konstante } c_{i}}\\ &= \sum_{i \in \mathbb{Z}} \underbrace{w_{i} c_i}_{=: \bar{w}_i} f_{i}^{z}(\underline{z})\\ &=\sum_{i \in \mathbb{Z}} \bar{w}_{i} f_{i}^{z}(\underline{z}) \end{aligned} $$

Hier sieht man, dass $f_i^z(\underline{z})$ einfach aus dem Integral ausgezogen werden kann. Innerhalb des Integrals gibt es nur $\underline{x}$.

Speizialfall: $f(\underline{x}) = \delta(\underline{x} - \underline{\hat{x}}) \Rightarrow$
$$ c_i = \int_{\mathbb{R}^{N}} f_{i}(\underline{x}) \delta \left(\underline{x}-\underline{\hat{x}}\right) d \underline{x}=f_i(\underline{\hat{x}}) $$
Bsp 2

Annahme: Gaussian Mixture
$$ f(\underline{x}) = \sum_{j=1}^L v_j f_j^*(\underline{x}) $$
Einsetzen in CK-Gleichung:
$$ \begin{aligned} f(\underline{z})&=\int_{\mathbb{R}^{N}}\left\{\sum_{i \in \mathbb{Z}} w_{i} f_{i}^{z}(\underline{z}) f_{i}^{x}(\underline{x})\right\} \cdot \left\{\sum_{i=1}^{L} v_{j} f_{j}^{*}(\underline{x})\right\} d x\\ &=\sum_{i \in \mathbb{Z}} w_{i} f_{i}^{z}(\underline{z}) \underbrace{\sum_{i=1}^{L} v_{j} \underbrace{\int_{\mathbb{R}^N} f_{i}^{x}(\underline{x}) \cdot f_{j}^{*}(\underline{x}) d \underline{x}}_{\text{Konstante}}}_{\text {Kondante } C_{i}} \\ &=\sum_{i \in \pi} \underbrace{w_{i}C_i}_{=: \bar{w}_i} f_{i}^{z}(\underline{z}) \\ &=\sum_{i \in \pi} \bar{w}_{i} f_{i}^{z}(\underline{z}) \end{aligned} $$

Filterschritt für nichtlineare Systeme

Wed, 03 Aug 2022 00:00:00 +0000

Skript 10.4

Rückwärtsinferenz: Inferenz entgegen der modellierter Abhängigkeit mit gegebenen Vorwissen

Zwei Fälle

Konkrekter Wert für Ausgang (Messung) gegeben

Dichte für Ausgang gegeben

Rückwärtsinferenz mit Konkrektem Messwert

Skript 10.4.1

Übungsblatt Aufg. 9.2, 9.3

Stochastische Abbildung von $a \in \mathbb{R}^N$ auf $b \in \mathbb{R}^M$

Probabilistischer Modell $f(b \mid a)$ (grafisch)

Für konkretes $\underline{\hat{b}}$, wir suchen $f(a \mid \underline{\hat{b}})$ 💪
$$ \begin{aligned} &f(\underline{a} \mid \underline{\hat{b}}) f(\underline{\hat{b}})=f(\underline{\hat{b}} \mid \underline{a}) \cdot f(\underline{a}) \\ &\Rightarrow \underbrace{ f(\underline{a} \mid \underline{\hat{b}})}_{\text{Posteriror}}=\underbrace{\frac{1}{f(\underline{\hat{b}})}}_{\text{Normalizationskonstant}} \cdot \underbrace{f(\underline{\hat{b}} \mid \underline{a})}_{\text{Likelihood}} \cdot \underbrace{f(\underline{a})}_{\text{Vorwissen}} \end{aligned} $$
Für Messmodell

Likelihood: $f(\underline{\hat{y}} \mid \underline{x})$, wobei $\underline{\hat{y}}$ die Messung ist

$f^p(\underline{x})$: Gegebene priore Verteilung (also die Prädiktion) für Zustand

$\Rightarrow$ Posteriore Verteilung:
$$ f^e(\underline{x}) = f(\underline{x} \mid \underline{\hat{y}}) \propto f(\underline{\hat{y}} \mid \underline{x}) \cdot f^p(\underline{x}) $$
Rückwärtsinferenz mit Dichte

Skript 10.4.2

Übungsblatt Aufg. 9.4

Spezialfall: Additives Rauschen

Skript 10.4.3
$$ \underline{y} = \underline{g}(\underline{x}) + \underline{v} = \underline{t} + \underline{v} $$
Generative Modell

Gegeben

Vorwissen über Zustand $\underline{x}$ in Form von $f_x(\underline{x})$

Messung $\underline{\hat{y}}$

Charakteristik der Messrauschen $\underline{v}$ durch $f_v(\underline{v})$

Gesucht: $f(\underline{x} \mid \underline{\hat{y}})$

Probabilistisches Modell: Faktorisierung Beschreibung der Vebundsdichte
$$ \begin{aligned} f(\underline{t}, \underline{v}, \underline{x}, \underline{y}) &= f(\underline{y} \mid \underline{t}, \underline{v}, \underline{x}) \cdot f(\underline{t}, \underline{v}, \underline{x}) \quad | \quad \underline{y}, \underline{x} \text{ sind unab.} \\\\ &= f(\underline{y} \mid \underline{t}, \underline{v}) \cdot f(\underline{t} \mid \underline{v}, \underline{x}) \cdot f(\underline{v}, \underline{x}) \quad | \quad \underline{v}, \underline{t} \text{ sind unab.} \\\\ &= f(\underline{y} \mid \underline{t}, \underline{v}) \cdot f(\underline{t} \mid \underline{x}) \cdot f(\underline{v}, \underline{x}) \quad | \quad \underline{v}, \underline{x} \text{ sind unab.}\\\\ &= \delta(\underline{y} - \underline{t} - \underline{v}) \cdot \delta(\underline{t} - \underline{g}(\underline{x})) \cdot f_v(\underline{v}) \cdot f_x(\underline{x}) \end{aligned} $$
Grafisches Modell

Betrachtung 1: Direkt Marginalisierung
$$ \begin{aligned} f(\underline{x} \mid \underline{\hat{y}}) &= \frac{f(\underline{x}, \underline{\hat{y}})}{f(\underline{\hat{y}})} \\ &= \frac{1}{f(\underline{\hat{y}})} \int_{\mathbb{R}^M} \int_{\mathbb{R}^M} f(\underline{t}, \underline{v}, \underline{x}, \underline{\hat{y}}) d\underline{v} d\underline{t} \\ &= \frac{1}{f(\underline{\hat{y}})} \int_{\mathbb{R}^M} \int_{\mathbb{R}^M} \delta(\underline{\hat{y}} - \underline{t} - \underline{v}) \cdot \delta(\underline{t} - \underline{g}(\underline{x})) \cdot f_v(\underline{v}) \cdot f_x(\underline{x}) d\underline{v} d\underline{t} \\ &= \frac{1}{f(\underline{\hat{y}})} f_x(\underline{x}) \int_{\mathbb{R}^M} \delta(\underline{t} - \underline{g}(\underline{x})) f_v(\underline{\hat{y}} - \underline{t}) d\underline{t} \\ &= \frac{1}{f(\underline{\hat{y}})} f_x(\underline{x})f_v(\underline{\hat{y}} - \underline{g}(\underline{x})) \end{aligned} $$
Betrachtung 2: Unsicheres System und deterministische Messung

Wir betrachte $f(\underline{y} \mid \underline{x})$ als ein Ersatzsystem.
$$ \begin{aligned} f(\underline{y} \mid \underline{x}) &= \frac{1}{f_x(\underline{x})} f(\underline{x}, \underline{y}) \\\\ &= \int_{\mathbb{R}^M} \int_{\mathbb{R}^M} \delta(\underline{t} - \underline{g}(\underline{x})) \delta(\underline{\hat{y}} - \underline{t} - \underline{v}) f_v(\underline{v}) d\underline{v} d\underline{t} \\\\ &= f_v(\underline{\hat{y}} - \underline{g}(\underline{x})) \end{aligned} $$
Damit folgt für das vereinfachte System

Gesuchte posteriore Dichte
$$ \begin{aligned} f(\underline{x} \mid \underline{\hat{y}}) &= \frac{f(\underline{x}, \underline{\hat{y}})}{f(\underline{\hat{y}})} \\\\ &= \frac{1}{f(\underline{\hat{y}})} \cdot f(\underline{\hat{y}} \mid \underline{x}) \cdot f(\underline{x}) \\\\ &= \frac{1}{f(\underline{\hat{y}})} \cdot f_v(\underline{\hat{y}} - \underline{g}(\underline{x})) \cdot f_x(\underline{x}) \end{aligned} $$
Betrachtung 3: Deterministisches System und unsichere Messung
$$ \begin{aligned} f(\underline{\hat{y}} \mid \underline{t}) &= \frac{f(\underline{\hat{y}}, \underline{t})}{f(\underline{t})} \\\\ &= \frac{1}{f(\underline{t})} \int_{\mathbb{R}^M} \underbrace{f(\underline{v}, \underline{t}, \underline{\hat{y}})}_{= f(\underline{\hat{y}} \mid \underline{v}, \underline{t}) f(\underline{v}, \underline{t}) = f(\underline{\hat{y}} \mid \underline{v}, \underline{t}) f(\underline{v}) f(\underline{t})} d\underline{v} \\\\ &= \frac{1}{f(\underline{t})} f(\underline{t}) \int_{\mathbb{R}^M} f_v(\underline{v}) \delta(\underline{\hat{y}} - \underline{t} - \underline{v}) d\underline{v} \\\\ &= f_v(\underline{\hat{y}} - \underline{t}) \end{aligned} $$
Gesuchte posteriore Dichte
$$ \begin{aligned} f(\underline{x} \mid \underline{\hat{y}}) &= \frac{f(\underline{x}, \underline{\hat{y}})}{f(\underline{\hat{y}})} \\\\ &= \frac{1}{f(\underline{\hat{y}})} \int_{\mathbb{R}^M} f(\underline{x}, \underline{t}, \underline{\hat{y}}) d\underline{t} \\\\ &= \frac{1}{f(\underline{\hat{y}})} \int_{\mathbb{R}^M} \underbrace{f(\underline{\hat{y}} \mid \underline{x}, \underline{t})}_{=f(\underline{\hat{y}} \mid \underline{t})} f(\underline{x}, \underline{t}) d\underline{t} \quad \mid \underline{x}, \underline{t} \text{ sind unab.} \\\\ &= \frac{1}{f(\underline{\hat{y}})} \int_{\mathbb{R}^M} f(\underline{\hat{y}} \mid \underline{t}) f(\underline{x}) f(\underline{t}) d\underline{t} \\\\ &= \frac{1}{f(\underline{\hat{y}})} \cdot f_v(\underline{\hat{y}} - \underline{g}(\underline{x})) \cdot f(\underline{x}) \end{aligned} $$
Schwierigkeiten Filterschritt

Problem 1: Type der Dichte zur Beschreibung der Schätzung ändert sich.

Beispiel:

Prior
$$ f^p(x) \propto \exp \left[-\frac{1}{2} \frac{(x - x^p)^2}{\sigma_p^2}\right] $$

Messabbildung
$$ y = x^2 + v \quad v \sim f^v(v) $$
z.B. $f^v(v)$ ist Gauß mit zero-mean und Varianz $=1$
$$ f^L(y \mid x) = f^v(y - x^2) \propto \exp \left[-\frac{1}{2} (y - x^2)^2\right] $$

Posteriror
$$ \begin{aligned} f^{e}(x) & \propto f^{p}(x) \cdot f^{L}(\hat{y} \mid x)\\ & \propto \exp \left[-\frac{1}{2}\left(\frac{x-x^{p}}{\sigma_{p}}\right)^{2}\right] \cdot \exp \left[-\frac{1}{2}\left(y-x^{2}\right)^{2}\right] \\ & \propto \exp \left[a x^{4}+b x^{3}+c x^{2}+d x+e\right] \end{aligned} $$
ist nicht mehr Gauß!🤪

Problem 2: Dichte wrid mit jedem Schritt komplexer

Beispiel

Prior ist eine Mixture mit 2 Komponente
$$ f^p(x) = \sum_{i=1}^2 f^{p, i}(x) $$

Messabbildung
$$ y = x + v \quad v \sim f^v(v) = \sum_{j=1}^2 f^{v, j}(v) $$

Posterior
$$ \begin{aligned} f^e(x) & \propto f^{p}(x) \cdot f^{v}(\hat{y}-x) \\ &=\left(\sum_{i=1}^{2} f^{p, i}(x)\right) \cdot\left(\sum_{j=1}^{2} f^{v, i}(\hat{y}-x)\right) \\ &=\sum_{i=1}^{4} f^{e_{i} i}(x) \end{aligned} $$
$\Rightarrow$ Insgesamt ist Approximation unvermeidbar! 🤪

Faktorgraphen und Message Passing

Wed, 03 Aug 2022 00:00:00 +0000

Faktorgraphen

Regeln

Beispiel

Message Passing

Definiere Nachricht an einer Kante

Schnitt zur Aufteilung eines Systems in 2 Teile

Betrachtung von Block mit einem Eingang und einem Ausgang

Gegeben: $R_x$ und $L_y$
$$ \begin{aligned} &R_{y}(y)=\int f(y \mid x) \cdot R_{x}(x) d x \\ &L_{x}(x)=\int f(y \mid x) \cdot L_{y}(y) d y \end{aligned} $$
Speizialfall: Lineares System
$$ \begin{aligned} y &= Hx\\ \Rightarrow f(y \mid x) &= \delta(y - Hx) \end{aligned} $$ $$ \begin{aligned} R_y(y) &= \int \delta(y-Hx) R_x(x) dx \quad \mid g(x):=y-Hx, g^\prime(x) = -H, x_1 = \frac{y}{H}\\ &= \int \frac{1}{|H|} \delta(x - \frac{y}{H}) R_x(x) dx \\ &= \frac{1}{|H|} R_x(\frac{y}{H}) \end{aligned} $$ $$ \begin{aligned} L_{x}(x) &=\int f(y \mid x) L_{y}(y) d y \\ &=\int \delta(y-H x) \cdot L_{y}(y) d y \\ &=L_{y}(H \cdot x) \end{aligned} $$
Beispiel

Gegeben: $\underline{\hat{x}}_4$

Gesucht: $f(\underline{x}_2 \mid \underline{\hat{x}}_4)$

Ziel: Rekursive Berechnung der Nachrichten

Direkt gegeben:
$$ R_{1}\left(\underline{x}_{1}\right)=f_{1}\left(\underline{x}_{1}\right) \quad L_{3}\left(\underline{x}_{3}\right)=f\left(\underline{\hat{x}}_{4} \mid \underline{x}_{3}\right) $$

Benötigt: $L_2(\underline{x}_2)$ und $R_2(\underline{x}_2)$
$$ \begin{aligned} &R_{2}\left(\underline{x}_{2}\right)=\int f\left(\underline{x}_{2} \mid \underline{x}_{1}\right) R_{1}\left(\underline{x}_{1}\right) d \underline{x}_{1} \\ &L_{2}\left(\underline{x}_{2}\right)=\int f\left(\underline{x}_{3} \mid \underline{x}_{2}\right) L_{3}\left(\underline{x}_{3}\right) d \underline{x}_{3} \end{aligned} $$

$\Rightarrow$ Fusionsergebnis:
$$ f\left(\underline{x}_{2} \mid \underline{\hat{x}}_{4}\right) \propto L_{2}\left(\underline{x}_{2}\right) \cdot R_{2}\left(\underline{x}_{2}\right) $$

Vereinfachte Filterung

Wed, 03 Aug 2022 00:00:00 +0000

Approximation der Likelihood

Vereinfachung der Likelihood $f(\underline{y} \mid \underline{x})$

Analog zu vereinfachter Prädiktion

Approximierte Repräsentation durch Gaussian Mixture

Wichtig: Entkoppelte Komponenten

$$ f(\underline{y} \mid \underline{x}) = \sum_{i \in \mathbb{Z}} f_i^y(\underline{y}) f_i^x(\underline{x}) $$

Resultierender vereinfachter Filterschritt

Likelihood für konkreten Messwert $\underline{\hat{y}}$:
$$ f^{L}(\underline{x})=f(\underline{\hat{y}} \mid \underline{x})=\sum_{i \in \mathbb{Z}} f_{i}^{y}(\underline{\hat{y}}) \cdot f_{i}^{x}(\underline{x}) $$
Priore Gaussian Mixture:
$$ f^{p}(\underline{x})=\sum_{j=1}^{L} f_{j}^{p}(\underline{x}) $$
$\Rightarrow$ Posterior:
$$ \begin{aligned} f^{e}(\underline{x}) & \propto f^{p}(\underline{x}) \cdot f^{L}(\underline{x}) \\ &= \left(\sum_{i \in \mathbb{z}} f_{i}^{y}(\underline{\hat{y}})\right) \cdot \left(\sum_{j=1}^{L} f_{i}^{p}(\underline{x}) \cdot f_{i}^{k}(\underline{x})\right) \end{aligned} $$
Aber Anzahl der Komponenten nimmt zu! 🤪

Einfache Filter für stark nichtlineare Systeme

Tue, 09 Aug 2022 00:00:00 +0000

Nutzung „einfacher“ Filter für stark nichtlineare Systeme

2 Variante

Approximation der Zustandsdichten durch Gaussian Mixture $\rightarrow$ Bank von nichtlinearen Kalman Filter für Prädiktion und Filterung

Approximation aller Dichten durch wertdiskrete Repräsentation $\rightarrow$ Wertdiskreter Filter

Gaussian Mixture Filter

Motivation

Approximation der Zustandsschätzung durch Gaussian Mixture
$$ f(\underline{x})=\sum_{i=1}^{L} w_{i} \mathcal{N}\left(\underline{x}-\underline{\hat{x}}_{i}, C_{i}\right) $$
mit
$$ \begin{aligned} &w_{i} \geqslant 0, \quad i \in\{1, \ldots,L\} \\ &\sum_{i=1}^{L} w_{i}=1 \end{aligned} $$
(Damit ist Gaussian Mixture für beliebige $L$ eine gültige Dichte)

Parameter

Gewichtsvektor $\underline{w} = [w_1, \dots, w_L]^T$

Mittelwerte $\underline{\hat{x}}_1, \dots, \underline{\hat{x}}_L$

Kovarianzmatrizen $C_1, \dots, C_L$

Gaussian Mixtures sind universelle Approximators. Falls $L$ genügend groß, kann jede Dichte beliebig genau approximiert werden.

Vorgehen

Ziel: Nutzung der Erkenntnisse zum Kalman Filter für schwach nichtlineare Systeme $\rightarrow$ stark nichtlinearer Fall

Deshalb: Individuelle Verarbeitung der einzelnen Komponente $i$ (also Vernachlässigung der Überlappung)

Ergibt Bank von nichtlinearen Kalman Filter, die parallel arbeiten.

Funktioniert besonders gut, wenn

Überlappung der Komponenten klein

einzelne Komponenten schmal (induzierte Nichtlinearität)

Prädiktionsschritt

Systemmodell
$$ \underline{x}_{k+1} = \underline{a}_k(\underline{x}_k) + \underline{w}_k $$
Einfache Schreiweise:
$$ \underline{z} = \underline{a}(\underline{x}) + \underline{w} \quad \underline{w} \sim \text{Gauß} $$
💡Kernidee: Aufspaltung der Chapman-Kolmogorov-Gleichung
$$ \begin{aligned} f^{p}(\underline{z})&=\int_{\mathbb{R}^{N}} f^{w}(\underline{z}-\underline{a}(\underline{x})) \cdot f^{e}(\underline{x}) d \underline{x}\\ &=\int_{\mathbb{R}^{N}} f^{w}(\underline{z}-\underline{a}(\underline{x})) \cdot\left[\sum_{i=1}^{c} w_{i} \mathcal{N} \left(\underline{x}-\underline{\hat{x}}_{i}^{e}, C_{i}^{e}\right)\right] d \underline{x}\\ &=\sum_{i=1}^{L} w_{i} \underbrace{\int_{\mathbb{R}^{N}} f^{w}(\underline{z}-\underline{a}(\underline{x})) \mathcal{N}\left(\underline{x}-\underline{\hat{x}}_{i}^{e}, C_{i}^{e}\right) d \underline{x}}_{\approx \mathcal{N}(\underline{z} - \underline{z}_{i+1}^p, C_{i+1}^p)} \end{aligned} $$
Also wir approximieren das Integral einfach mit einem lokalen Posterior für jedes $i$, die wieder Gauß ist, da sie so schmal ist.

$\underline{z}_{i+1}^p, C_{i+1}^p$ durch Anwendung nichtlinearer Kalman Filter

Filterschritt

Messmodell:
$$ \underline{y}_k=\underline{h}_{k}\left(\underline{x}_{k}\right)+\underline{v}_{u} $$
Einfache Schreibweise:
$$ \underline{y}=\underline{h}_{k}(\underline{x})+\underline{v} \quad \underline{v} \sim \operatorname{Gauß} $$
Filterschritt
$$ \begin{aligned} f^{e}(\underline{x}) &= \underline{c^{e}}_{\text{Normalisierungskonstante}} f^{v}\left(\underline{\hat{y}}-\underline{h}(\underline{x})\right) \cdot \sum_{i=1}^{L} w_{i} \mathcal{N}\left(\underline{x}-\hat{\underline{x}}_{i}^{p}, C_{i}^{p}\right) \\ &=c^{e} \sum_{i=1}^{L} w_{i} \underbrace{f^{v}(\underline{\hat{y}}-\underline{h}(\underline{x})) \cdot \mathcal{N} \left(\underline{x}-\underline{\hat{x}}_{i}^{p}, C_{i}^{p}\right)}_{\approx k_i \mathcal{N} \left(\underline{x}-\underline{\hat{x}}_{i}^{e}, C_{i}^{e}\right)} \\ &= c^e \sum_{i=1}^{L} w_{i} k_i \mathcal{N} \left(\underline{x}-\underline{\hat{x}}_{i}^{e}, C_{i}^{e}\right) \end{aligned} $$
$\underline{\hat{x}}_{i}^{e}, C_{i}^{e}$ durch nichtlinearen Kalman Filter bestimmen.

Rasterbasierte Filter

Rasterbasierte Repräsentation von Dichten

Zunächst: Skalarer Fall

Gegeben: Dichte $f(x), x \in \mathbb{R}$

Gescuht: Wertdiskrete Repräsentation
$$ \underline{\eta} \in \mathbb{R}_{+}^{L}, \quad \underline{\mathbf{1}}^{\top} \cdot \underline{\eta}=1 \text{ (Normalisierung)} $$

Rasterbasierter Filter- und Prädiktionsschritt
$$ \underline{\eta}=\left[\begin{array}{c} \eta _{1} \\ \eta _{2} \\ \vdots \\ \eta _{L} \end{array}\right] $$
Annahme: Repräsentiere $\eta_i$ in Mitte jedes Intervalls durch Dirac’sche Deltafunktion

Kriterium: Integralwerte sollen gleich sein.
$$ \begin{aligned} &\int_{x_{i-1}}^{x_i}f(x)dx \overset{!}{=} \underbrace{\int_{x_{i-1}}^{x_i} \eta_i \cdot \delta(x - \frac{x_i + x_{i-1}}{2}) dx}_{=\eta_i} \\ &\Rightarrow \eta_i \propto \int_{x_{i-1}}^{x_i}f(x)dx \quad i \in \{1, \dots, L\} \end{aligned} $$
Normalisierung erfordlich:
$$ \eta_{i}:=\frac{\eta_{i}}{\sum_{i} \eta_{i}} \quad i \in\left\{1, \dots, L\right\} $$
In vielen Fällen, Integral über $f(x)$ nicht analytisch lösbar. $\Rightarrow$ Integration zu aufwändig.

Alternative: Stückweise Konstant Approximation von $f(x)$

Aber: Optimaler Vergleich erfordert auch Integration

Deshalb: Verwendung des Dichtwerts an Stelle
$$ h_i = f(\frac{x_i + x_{i-1}}{2}) $$
Damit
$$ \begin{aligned} &\int_{x_{i-1}}^{x_i} f(x) dx \approx \int_{x_{i-1}}^{x_i} h_i dx = h_i(\underbrace{x_i - x_{i-1}}_{=\Delta}) \\ & \Rightarrow \eta_i \propto \Delta \cdot h_i = \Delta \cdot f(\frac{x_i + x_{i-1}}{2}) \end{aligned} $$
mit Normalisierung

Rasterbasierter Filterschritt

Generatives Modell
$$ y = h(x, v) $$
Kovertiere in probabilitisches Modell $f(y \mid x)$

Messung $\hat{y}$ sidn nicht wertdiskret $\rightarrow$ Quantisierung von $f(\hat{y} \mid x) = f^L(x)$

Da $f(\hat{y} \mid x)$ i.d.R. nicht analytisch integrierbar $\rightarrow$
$$ \eta_{i}^{L} \propto \Delta f^{L}\left(\frac{x_{i}+x_{i-1}}{2}\right) \quad i \in\{1, \dots,L\} $$
und Normalisierung.

Für gegebene Dichte $\underline{\eta}^{p}=\left[\eta_{1}^{p}, \eta_{2}^{p}, \ldots, \eta_{L}^{p}\right]^{\top}$
$$ \underline{\eta}^{e} \propto \underline{\eta}^{L} \odot \underline{\eta}^{p} \tag{posteriore Verteilung} $$
Rasterbasierter Prädiktionsschritt

Generatives Modell
$$ x_{k+1} = a_k(x_k, w_k) $$
Einfache Schreibweise
$$ z = a(x, w) $$
probabilitisches Modell: $f(z \mid x)$

Hier müssen wir für skalare Zustände eine 2D-Dichte quantisieren.

$\Rightarrow$ Es ergibt sich eine Matrix
$$ A_{i j} \propto f\left(\frac{z_{j}+z_{j-1}}{2}, \frac{x_{i}+x_{i-1}}{2}\right) $$
Normalisierung

Es handelt sich um Transitionsmatrix

Stochastische Matrix, Zeilensumme = 1

$A_{i j}:=\displaystyle\frac{A_{i j}}{\sum_{i=1}^{L} A_{i j}}, i \in\{1, \ldots,L\}$

Gegeben:

Transitionsmatrix $A \in \mathbb{R}_{+}^{L \times L}$

Schätzung aus letzen Filterschritt $\underline{\eta}^e \in \mathbb{R}_{+}^{L}$

Ergebnis des Prädiktionsschritts:
$$ \underline{\eta}^p = A^\top \underline{\eta}^e $$
Aufwändiger als Filterschritt 🤪

Erweiterung Prädiktionsschritt

Bisher angenommen: Raster für $z$ (also $x_{k+1}$) schon bekannt/fest

Das ist leider nicht praxisgerecht, da sich Wertbereich aus Abbildung ergibt.

Speizialfall: Lineares System mit additives Rauschen (i. Allg. schwieriger)
$$ z = \underbrace{x + u}_{z^\prime} + w \quad w \sim f^w(w) $$

Zwischengröße $z^\prime$: Nutze Eingang $\hat{u}$, um Raster zu verschieben (bewegliches Raster)
$$ z_i^\prime = x_i + \hat{u} \quad i \in \{1, \dots, L\} $$
Wir setzen $z_i = z_i^\prime$

Danach Faltung mit $f_w(w)$:
$$ fz\left(z^{\prime}\right)=f^{w}\left(z-z^{\prime}\right) $$
Dann Quantisierung von $f(z \mid z^\prime) \Rightarrow$
$$ \begin{aligned} A_{i j} &=f^{w}\left(\frac{z_{j}+z_{j - 1}}{2} \mid \frac{z_{i}^{\prime}+z_{i - 1}^\prime}{2}\right) \\ A_{i j}&=f^{\omega}\left(\frac{1}{2}\left[z_{i}+z_{j-1}-\left(z_{i}^{\prime}+z_{i-1}^{\prime}\right)\right]\right) \end{aligned} $$
Wir wissen
$$ \begin{aligned} \frac{z_{i}+z_{j-1}}{2}&=\frac{2 j-1}{2} \Delta+z_{0} \\ \frac{z_{i}^{\prime}+z_{i-1}^{\prime}}{2}&=\frac{z_{i}-1}{2} \Delta+z_{0}^{\prime} \\ \Rightarrow A_{ij} &= f^w(\Delta[j - i]), \text{ falls } z_0 = z_0^\prime \Rightarrow j - i \in \{-(L-1), \dots, -1, 0, 1, \dots, L - 1\} \end{aligned} $$
Vorabdiskretisierung von $f^w(\cdot)$

Eintragen der Werte in Transitionsmatrix $A$ mit $A_{ij} = f^w(\Delta(j-i))$

Dann Berechnung der Posteriro wie gehabt.

Rekonstruktion kontinuierlicher Dichten

Ergebnis von Prädiktion und Filterung in wertdiskreter Form $\underline{\eta} \in \mathbb{R}_+^L$

Berechnung von Kenngröße einfach, dazu Positionen erforderlich

Erwartungswert
$$ \begin{aligned} \hat{x} &=\int_{\mathbb{R}} x \sum_{i=1}^{2} \eta_{i} \int\left(x-\frac{x_{i}+x_{i-1}}{2}\right) d x \\ &=\sum_{i=1}^{L} \eta_{i} \frac{x_{i}+x_{i-1}}{2} \quad \mid \frac{x_{i}+x_{i-1}}{2}=\frac{2 i-1}{2} \Delta+x_{0} \\ &= \sum_{i=1}^{L} \eta_{i} (\frac{2i-1}{2} \Delta + x_0) \end{aligned} $$
Analog für Varianz.

Gesucht: kontinuierliche Rekonstruktion $f(x)$ aus $\eta$

Als Dirac Mixture
$$ f(x) \approx \sum_{i=1}^{L} \eta_{i} \delta\left(x-\frac{x_{i}+x_{i-1}}{2}\right) $$
Verschiedenen Möglichkeiten der Interpolation

Stückweise Konstante Interpolation
$$ \int_{x_{i-1}}^{x_{i}} h_{i} d x \overset{!}{=} \int_{x_{i=1}}^{x_{i}} u_{i} \delta() d x \Rightarrow h_{i}=\frac{\eta_{i}}{\Delta} $$

Stetige, stückweise lineare Interpolation
$$ (t_i + t_{i-1}) \frac{\Delta}{2} = \eta_i $$
und weitere Bedingung
$$ t_0 = t_1 $$

Erweiterungen

Mehrdimensional Fall: $\underline{x} \in \mathbb{R}^N$

Filterschritt analog

Prädiktionsschritt: $f(\underline{z} \mid \underline{x})$ nun von $\mathbb{R}^N$aud $\mathbb{R}^N \Rightarrow A \in \mathbb{R}^{2N}$

Lösung

Bewegliches Raster für nichtlineare Systemmodelle

Adaptive Auflösung eines äquidistanten / homogenen Rasters

Inhomoge Raster $\rightarrow$ variable Auflösung

Effiziente Implementierung, z.B. dünn besetzte Matrizen ($0$ nicht explizit dargestellt)

Zusammenfassung

Sun, 07 Aug 2022 00:00:00 +0000

Vorwärtsinferenz

Gegeben

$f_a(a)$

$g(a)$

Gesucht: $f_b(b)$

Schritte:

Umforme $f(b \mid a) = \delta(b - g(a))$ mit
$$ \delta (g(x)) = \sum_{i=1}^N \frac{1}{|g^\prime(x_i)|}\delta (x - x_i) $$
wobei

$g(x_i) = 0$ (also $x_i$ sind Nullstellen, $i = 1, 2, \dots, N$)

$g^\prime(x_i) \neq 0$

Berechne $f_b(b)$ mithilfe von Chapman-Kolmogorov-Gleichung
$$ f(b) = \int f(b \mid a) f(a) da $$
und setze die Unformung von $f(b \mid a)$ von Schritt 1 ein. Dann kriege die gesuchte Dichtefunktion $f_b(b)$ in Abhängigkeit von $f_a(a)$.

Bsp: Aufgabe 9.1

Rückwartsinferenz

Konkrete Messung

Umforme $f_b(b \mid a) = \delta(b - g(a))$ mit
$$ \delta (g(x)) = \sum_{i=1}^N \frac{1}{|g^\prime(x_i)|}\delta (x - x_i) $$
wobei

$g(x_i) = 0$ (also $x_i$ sind Nullstellen, $i = 1, 2, \dots, N$)

$g^\prime(x_i) \neq 0$

Berechne $f_b(b)$
$$ f_b(b) = \int f_{a, b}(a, b) da = \int f_{b}(b \mid a) f_a(a) da $$
mit Einsetzen der Unformung von $f(b \mid a)$ von Schritt 1 ein

Berechne $f_a(a \mid b)$ mithilfe von Bayes Regeln
$$ f_a(a \mid b) = \frac{f_a(b \mid a) f_a(a)}{f_b(b)} = \frac{\overbrace{\delta(b - g(a))}^{\text{Schritt 1}} f_a(a)}{\underbrace{f_b(b)}_{\text{Schritt 2}}} $$

Bsp: Aufgabe 9.2, 9.3

Unsichere Messung

Schritte:

Erweitere das System um eine zusätzliche stochastische Abbildung und einen festen Ausgang $\hat{z}$

Bestimme $f(\hat{z} \mid y)$
$$ \begin{aligned} f(\hat{z} \mid y) &= \frac{f(y \mid \hat{z})f(\hat{z})}{f(y)} \\\\ &= \frac{f(y \mid \hat{z})f(\hat{z})}{\int f(y, x) dx} \\\\ &= \frac{f(y \mid \hat{z})f(\hat{z})}{\int f(y|x)f(x) dx} \\\\ &= \frac{f(y \mid \hat{z})f(\hat{z})}{\int \delta(y - g(x)) f(x) dx} \\\\ \end{aligned} $$
Und setze die Umformung von $\delta(y - g(x))$
$$ \delta (g(x)) = \sum_{i=1}^N \frac{1}{|g^\prime(x_i)|}\delta (x - x_i) $$

$g(x_i) = 0$ (also $x_i$ sind Nullstellen, $i = 1, 2, \dots, N$)

$g^\prime(x_i) \neq 0$

ein.

Berechung der Rückwärtsinferenz $f(x \mid \hat{z})$
$$ \begin{aligned} f(x \mid \hat{z}) &=\frac{1}{f\left(\hat{z}\right)} \cdot f(x, z) \quad \mid \text{Marginalisierung nach } y\\ &=\frac{1}{f(\hat{z})} \int f(x, y, z) d y \\ &=\frac{1}{f(\hat{z})} \int f(\hat{z} \mid y, x) \cdot f(y , x) d y \quad \mid \hat{z}, x \text{ sind unabhängig}\\ &=\frac{1}{f(\hat{z})} \int f(\hat{z} \mid y) \cdot f(y \mid x) \cdot f(x) d y \\ &=\frac{1}{f(\hat{z})} \int \underbrace{f(\hat{z} \mid y)}_{\text{Berechnet in Schritt 1}} \cdot \underbrace{f(y \mid x)}_{\text{Systemmodell}} \cdot f(x) d y \end{aligned} $$

Bsp: Aufgabe 9.4

Sample-basierte Filter

Wed, 10 Aug 2022 00:00:00 +0000

Stochastische Informationsverarbeitung

Thu, 26 May 2022 00:00:00 +0000

Meta Information

Lecture website: Stochastische Informationsverarbeitung

Semester: WS21/22

Language: German

Lecturer:

Prof. Dr.-Ing. Uwe Hanebeck

Daniel Frisch

Exam type: Oral

Beschreibung

Inhalt

Die SI vermittelt die fundamentalen und formalen Grundlagen der Zustandsschätzung rund um Prädiktion und Filterung.

Modelle und Zustandsschätzer für wertediskrete und -kontinuierliche lineare sowie allgemeine Systeme werden behandelt

Für wertediskrete und -kontinuierliche lineare Systeme

Prädiktion und Filterung (HMM, Kalman Filter)

Glättung für wertediskrete Systeme (zusätzlich)

Modellierung von allgemeinen statischen und dynamischen Systemen

Entwickeln ausgehend von einer generativen eine probabilistische Systembeschreibung

Unterschiedliche Arten des Rauscheinflusses (additiv, multiplikativ) sowie verschiedene Dichterepräsentationen werden untersucht.

Grundlegenden Methoden der Zustandsschätzung für allgemeine Systeme

Herausforderungen bei der Implementierung generischer Schätzer

Ziel

Wiederholung von Grundlagen Wahrscheinlichkeitstheorie

Gefühl für Systemtheorie und Behandlung von Unsicherheiten

Verständnis für

Systemmodellierung und Systemidentifikation

grundlegende Schätz-, Fusion-, Filterungs- und Prädiktionsverfahren

Bewusstsein für Schwierigkeiten und Herausforderungen

Herleitung und Anwendung von exakten Schätzern für

wertediskrete Systeme

lineare wertekontinuierliche Systeme

Herleitung und Anwendung von approximativen Schätzern für

schwach nichtlineare Systeme

Struktur

Wertediskrete Systeme

Statische Systeme

Dynamische Systeme: Markov-Kette, Messmodell

Zustandsschätzung im Hidden Markov Model

Wertekontinuierliche lineare Systeme

Statische Systeme

Dynamische Systeme: Systemmodell mit Markov-Eigenschaft, Messmodell

Zustandsschätzung: Kalman Filter

Wertekontinuierliche und schwach nichtlineare Systeme

Statische Systeme

Dynamische Systeme

Nichtlineare Schätzung durch Linearisierung (EKF)

Nichtlineare Schätzung durch Kalmanfilter in probabilistischer Form

Berechnung der Momente: analytisch, numerisch, basierend auf Abtastwerten (UKF)

Ensemble Kalmanfilter (EnKF)

Allgemeine Systeme

Dirac’sche Deltafunktion

Funktionen von Zufallsvariablen

Probabilistische Systemmodelle, Abstraktion

Prädiktion nichtlinearer Systeme

Filterschritt für nichtlineare Systeme

Faktorgraphen und Message Passing

Einfache Filter für stark nichtlineare Systeme

Sample-basierte Filter

Reapproximation von kontinuierlichen Dichten mit Samples

Partikelfilter

Progressive Filterung

Empirische Momente von zufälligen und deterministischen Samples

Wed, 10 Aug 2022 00:00:00 +0000

Erzeugung von Samples
$$ y_i = m + \sigma \cdot w_i \quad w_i \in \mathcal{N}(0, 1) \quad i = 1, \dots , L $$

$w_i$: Grundsamples

$m$: Mittelwert

Check:
$$ \begin{aligned} &E\left\{y_{i}\right\}=E\left\{m+\sigma \cdot w_{i}\right\}=E\{m\}+\sigma \cdot E\left\{w_{i}\right\}=m \\ &E\left\{\left(y_{i}-m\right)^{2}\right\}=E\left\{\left(\sigma \cdot w_{i}\right)^{2}\right\}=\sigma^{2} E\left\{w_{i}^{2}\right\}=\sigma^{2} \end{aligned} $$
Empirische Schätzer
$$ \hat{m}=\frac{1}{L} \sum_{i=1}^{L} y_{i} $$ $$ \begin{aligned} \hat{c}=\hat{\sigma}^{2} &=\frac{1}{L} \sum_{i=1}^{L}\left(y_{i}-\hat{m}\right)^{2} \\ &=\frac{1}{L} \sum_{i=1}^{L}\left(y_{i}^{2}-2 \hat{m} y_{i}+\hat{m}^{2}\right) \\ &=\frac{1}{L} \sum_{i=1}^{L} y_{i}^{2}-\left(\frac{1}{L} \sum_{i=1}^{L} y_{i}\right)^{2} \\ &=\frac{1}{L} \sum_{i=1}^{L} y_{i}^{2}-\frac{1}{L^{2}} \sum_{i=1}^{L} \sum_{j=1}^{L} y_{i} y_{j} \end{aligned} $$
Überprüfung

Mittelwert
$$ \begin{aligned} E\{\hat{m}\} &=E\left\{\frac{1}{L} \sum_{i=1}^{L}\left(m+2 w_{i}\right)\right\} \\ &=\frac{1}{L} \sum_{i=1}^{L} E\left\{m_{i}+w_{i}\right\} \\ &=m \quad ✅ \end{aligned} $$

Varianz
$$ \begin{aligned} E\{\hat{C}\}&=E\left\{\frac{1}{L} \sum_{i=1}^{L}\left(m+\sigma w_{i}\right)^{2}-\frac{1}{L^{2}} \sum_{i=1}^{l} \sum_{j=1}^{c}\left(m_{1}+\sigma w_{i}\right)\left(m+\sigma w_{i}\right)\right\}\\ &=\frac{1}{L} \sum_{i=1}^{L} E\left\{m^{2}+2\sigma w_{i}+\sigma^{2} w_{i}^{2}\right\}- \\ & \qquad\frac{1}{L^{2}} \sum_{i=1}^{L} \sum_{j=1}^{L} E\left\{m^{2}+m \sigma w_{i}+m \sigma w_{j}+\sigma^{2} w_{i} w_{j}\right\}\\ &=\frac{1}{L} \sum_{i=1}^{L}\left(m^{2}+\sigma^{2}\right)-\frac{1}{L^{2}} \sum_{i=1}^{L} \sum_{j=1}^{L}\left(m^{2}+\sigma^{2} E\left\{\omega_{i} \omega_{j}\right\}\right)\\ &=m^{2}+\sigma^{2}-m^{2}-\frac{1}{L^{2}} \cdot L \cdot{\sigma^{2}}^{2}\\ &=\sigma^{2}-\frac{1}{L} \sigma^{2}\\ &=\frac{L-1}{L} \cdot \sigma^{2} \end{aligned} $$

Für deterministische Samples (z.B.)
$$ \begin{aligned} &y_{1}=m-\sigma \\ &y_{2}=m+\sigma \end{aligned} $$ $$ \begin{aligned} \hat{m} &=\frac{1}{2}(m-\sigma+m+\sigma)=m \quad ✅ \\ \hat{z}^{2} &=\frac{1}{2}\left[(m-\sigma)^{2}+(m+\sigma)^{2}\right]-\frac{1}{4}(m-\sigma+m+\sigma)^{2} \\ &=\frac{1}{2}\left[m^{2}-2 m \sigma+\sigma^{2}+m^{2}+2 m \sigma+\sigma^{2}\right]-m^{2} \\ &=\frac{1}{2}\left[2 m^{2}+2\sigma^{2}\right]-m^{2} \\ &=\sigma^{2} \end{aligned} $$

Reapproximation von Dichten

Wed, 10 Aug 2022 00:00:00 +0000

4 cases

Examples for reapproximation

Continuous → continuous: Gaussian mixture reduction

Continuous → discrete: deterministic sampling, i.e., replacing a continuous density with Dirac mixture

Discrete → continuous: density estimation, i.e., finding a continuous density representing a set of given samples

Discrete → discrete: Dirac mixture reduction

Challenge: Three cases involving discrete densities

Continuous → continuous case: Use standard distance measures, e.g. integral squared distance (ISD)

Discrete densities prohibit use of standard distance measures

Here we focus on continuous → discrete Reapproximation

Given: Continuous density $\tilde{f}(\underline{x})$

Deterministic sampling, i.e., approximation with Dirac mixture

Definition of Dirac mixture with $L$ components
$$ f(\underline{x})=\sum_{i=1}^{L} w_{i} \cdot \delta\left(\underline{x}-\underline{\hat{x}}_{i}\right) $$

Weights $w_{i}>0, \displaystyle \sum_{i=1}^{L} w_{i}=1$

$\underline{x}_i$: locations / samples

🎯 Goal: Systematic approximation of given continuous density

Application examples

Mapping of random variables through nonlinear functions

Sample-based fusion and estimation (UKF)

Univariate Case (1D)

Synthesis

Instead of comparing densities $\tilde{f}(x), f(x)$, we compare cumulative distribution functions (CDFs) $\tilde{F}(x), F(x)$

CDF of $f(x)$:
$$ F(x)=P(\boldsymbol{x} \leq x)=\int_{-\infty}^{x} f(u) \mathrm{d} u $$

This definition is unique, as other definition $\bar{F}(x)=P(\boldsymbol{x}>x)$ is dependent
$$ \begin{aligned} \bar{F}(x)=&P(\boldsymbol{x}>x) \\ &=\int_{x}^{\infty} f(u) d u \\ &=1-\int_{-\infty}^{x} f(u) d u \\ &=1-P(\boldsymbol{x} \leq x) \\ &=1-F(x) \end{aligned} $$

Example

Dirac mixture density
$$ f(x, \underline{\hat{x}})=\sum_{i=1}^{L} w_{i} \delta\left(x-\hat{x}_{i}\right) $$
Dirac mixture cumulative distribution
$$ F(x, \underline{\hat{x}})=\sum_{i=1}^{L} w_{i} \mathrm{H}\left(x-\hat{x}_{i}\right) \text { with } \mathrm{H}(x)=\int_{-\infty}^{x} \delta(t) \mathrm{d} t= \begin{cases}0 & x<0 \\ \frac{1}{2} & x=0 \\ 1 & x>0\end{cases} $$
with the Dirac position
$$ \underline{\hat{x}}=\left[\hat{x}_{1}, \hat{x}_{2}, \ldots, \hat{x}_{L}\right]^{\top} $$

CDF of $\tilde{f}(x)$ follows analogously:
$$ \tilde{F}(x)=\int_{-\infty}^{x} \tilde{f}(u) \mathrm{d} u $$

Example

Gaussian density:
$$ \tilde{f}(x)=\frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{1}{2} x^{2}\right) $$
$\Rightarrow$ Gaussian cumulative distribution:
$$ \tilde{F}(x)=\frac{1}{2}\left(1+\operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)\right) $$

We compare $\tilde{F}(x), F(x)$ use Cramér–von Mises distance:
$$ D(\underline{\hat{x}})=\int_{\mathbb{R}}(\tilde{F}(x)-F\left(x, \underline{\hat{x}})\right)^{2} \mathrm{~d} x $$
Minimization of Cramér–von Mises distance

Gradient of the distance measure $D(\underline{\hat{x}})$:
$$ \underline{G}(\underline{\hat{x}})=\nabla D(\underline{\hat{x}})=\frac{\partial D(\underline{\hat{x}})}{\partial \underline{\hat{x}}}=\left[\frac{\partial D(\underline{\hat{x}})}{\partial \hat{x}_{1}}, \frac{\partial D(\underline{\hat{x}})}{\partial \hat{x}_{2}}, \ldots, \frac{\partial D(\underline{\hat{x}})}{\partial \hat{x}_{L}}\right]^{\top} $$
with
$$ \frac{\partial D(\underline{\hat{x}})}{\partial \hat{x}_{i}}=2 w_{i} \int_{-\infty}^{\infty}[\tilde{F}(t)-F(t, \underline{\hat{x}})] \delta\left(t-\hat{x}_{i}\right) \mathrm{d} t $$
or
$$ \frac{\partial D(\underline{\hat{x}})}{\partial \hat{x}_{i}}=2 w_{i}\left[\tilde{F}\left(\hat{x}_{i}\right)-F\left(\hat{x}_{i}, \underline{\hat{x}}\right)\right] \text { with } F\left(\hat{x}_{i}, \underline{\hat{x}}\right)=\sum_{j=1}^{L} w_{j} \mathrm{H}\left(\hat{x}_{i}-\hat{x}_{j}\right) $$
The Hesse matrix is
$$ \mathbf{H}(\underline{x})=\operatorname{diag}\left(\left[\frac{\partial^{2} D(\underline{\hat{x}})}{\partial \hat{x}_{1}^{2}}, \frac{\partial^{2} D(\underline{\hat{x}})}{\partial \hat{x}_{2}^{2}}, \ldots, \frac{\partial^{2} D(\underline{\hat{x}})}{\partial \hat{x}_{L}^{2}}\right]\right) $$
with
$$ \frac{\partial^{2} D(\underline{\hat{x}})}{\partial \hat{x}_{i}^{2}}=2 w_{i} \tilde{f}\left(\hat{x}_{i}\right) $$
Sorted Locations & Equal Weights

When location vector $\underline{\hat{x}}$ is sorted, i.e., $\hat{x}_{1}<\hat{x}_{2}<\ldots<\hat{x}_{L}$ , we obtain
$$ H\left(\hat{x}_{i}-\hat{x}_{j}\right)= \begin{cases} 0 & i < j \\ \frac{1}{2} & i=j \\ 1 & i > j \end{cases} $$
Cumulative distribution can be simplified
$$ F\left(\hat{x}_{i}, \underline{\hat{x}}\right)=\frac{w_{i}}{2}+\sum_{j=1}^{i-1} w_{j} $$
When samples are equally weighted (i.e. $w_i = \frac{1}{L}$), we get
$$ F(\hat{x}_{i}, \underline{\hat{x}}) = \frac{1}{2L} + \frac{i-1}{L} = \frac{2i - 1}{2L} \qquad i = 1, \dots, L $$
Analytic solutions (possible in some special cases)
$$ \tilde{F}\left(\hat{x}_{i}\right)-F\left(\hat{x}_{i}, \underline{\hat{x}}\right)=0 \Rightarrow \hat{x}_{i}=\tilde{F}^{-1}(\frac{2 i-1}{2 L}) \qquad i=1, \ldots, L $$

E.g. Gaussian distribution:
$$ \tilde{F}^{-1}(x)=\sqrt{2} \operatorname{erfinv}((2 i-1) /(2 L)) $$

Example: DMA of standard Normal Distribution

With increasing number of Dirac functions, the CDF can be well approximated.

General Optimization

In general: Minimum of $D(\underline{\hat{x}})$ is obtained iteratively using Newton’s method
$$ \Delta \underline{\hat{x}}=-\mathbf{H}^{-1}(\underline{\hat{x}}) \underline{G}(\underline{\hat{x}}) $$
with
$$ \underline{G}(\underline{\hat{\hat{x}}})=\left[\frac{\partial D(\underline{\hat{x}})}{\partial \hat{x}_{1}}, \frac{\partial D(\underline{\hat{x}})}{\partial \hat{x}_{2}}, \ldots, \frac{\partial D(\underline{\hat{x}})}{\partial \hat{x}_{L}}\right]^{\top} $$
and
$$ \frac{\partial D(\underline{\hat{x}})}{\partial \hat{x}_{i}}=2 w_{i}\left[\tilde{F}\left(\hat{x}_{i}\right)-F\left(\hat{x}_{i}, \underline{\hat{x}}\right)\right] $$
The Hessian $\mathbf{H}(\underline{x})$ is given by
$$ \mathbf{H}(\underline{\hat{x}})=2 \operatorname{diag}\left(\left[w_{1} \tilde{f}\left(\hat{x}_{1}\right), w_{2} \tilde{f}\left(\hat{x}_{2}\right), \ldots, w_{L} \tilde{f}\left(\hat{x}_{L}\right)\right]\right) $$
The resulting Newton step:
$$ \Delta \underline{\hat{x}}=-\left[\frac{\tilde{F}\left(\hat{x}_{1}\right)-F\left(\hat{x}_{1}, \underline{\hat{x}}\right)}{\tilde{f}\left(\hat{x}_{1}\right)}, \frac{\tilde{F}\left(\hat{x}_{2}\right)-F\left(\hat{x}_{2}, \underline{\hat{x}}\right)}{\tilde{f}\left(\hat{x}_{2}\right)}, \ldots, \frac{\tilde{F}\left(\hat{x}_{L}\right)-F\left(\hat{x}_{L}, \underline{\hat{x}}\right)}{\tilde{f}\left(\hat{x}_{L}\right)}\right]^{\top} $$
Extension to Multivariate Distributions

Extension to multivariate case is not trivial

Less nice properties of multivariate cumulative distributions 🤪

Distinguish several classes of multivariate methods

Methods that generalize concept of CDF

Methods that perform reduction to univariate case

Kernel-based methods

Continuous flow between density approximations

Challenge of Multivariate Cumulative Distributions

Definition for 2D:
$$ F(u, v)=\int_{-\infty}^{u} \int_{-\infty}^{v} f(x, y) d x d y $$
However, $F(u, v)$ is asymmetric and definition is not unique.

Alternative definitions:
$$ \begin{aligned} &F(u, v)=\int_{-\infty}^{u} \int_{v}^{\infty} f(x, y) d x d y \\ &F(u, v)=\int_{u}^{\infty} \int_{v}^{\infty} f(x, y) d x d y \\ &F(u, v)=\int_{u}^{\infty} \int_{-\infty}^{v} f(x, y) d x d y \end{aligned} $$

Three CDFs are independent, forth is dependent.

For general $N$–dimensional random vectors: $2^N$ different variants,, $2^N - 1$ are independent

$\rightarrow$ exponentially complex!

Use in statistical tests difficult

Results differ depending on variant

Thus, we require generalization of concept of CDF. 💪

Localized Cumulative Distributions (LCDs)

Univariate case (1D)

💡 Key idea

Compare local probability masses of $\tilde{f}(x)$ and $f(x)$

Integrate over intervals at all positions $m$ and all widths $b$

Compare $\tilde{A}(m, b)$ and $A(m,b), \forall m, b$

Symmetric, unique, but redundant…

Multivariate case (2D)

Different kernels possible

Rectangular kernels

Gaussian kernels

Anisotropic vs. isotropic kernels

Separable vs. inseparable kernels

Cumulative Transformation of Densities

Given

Random vector $\underline{x} \in \mathbb{R}^N$

Probability density function $f(\underline{x}): \mathbb{R}^N \to \mathbb{R}_+$

Localized Cumulative Distribution (LCD):
$$ F(\underline{m}, b)=\int_{\mathbb{R}^{N}} f(\underline{x}) K(\underline{x}-\underline{m}, b) \mathrm{d} \underline{x} $$

$K(\cdot, \cdot)$: Kernel

$\underline{m}$: Kernel location

$\underline{b}$: Kernel width

Specific kernel employed:
$$ K(\underline{x}-\underline{m}, b)=\prod_{k=1}^{N} \exp \left(-\frac{1}{2} \frac{\left(x_{k}-m_{k}\right)^{2}}{b^{2}}\right) $$

Separable (i.e. in form of product)

isotropic (i.e. same in each direction)

Gaussian

Properties of LCD:

Symmetric

Unique

Multivariate

Examples

LCD (Rectangular Kernel)

LCD (Gaussian Kernel)

Generalized Cramér–von Mises Distance (GCvD)

Given:

LCD of given continuous density $\tilde{F}(\underline{m}, b)$

LCD of Dirac mixture $F(\underline{m}, b)$

Definition:
$$ D=\int_{\mathbb{R}_{+}} w(b) \int_{\mathbb{R}^{N}}(\tilde{F}(\underline{m}, b)-F(\underline{m}, b))^{2} \mathrm{~d} \underline{m} \mathrm{~d} b $$
Minimization of GCvD:

For many Dirac components → high-dimensional optimization problem

Gradient available, Hessian more difficult

Use Quasi-Newton method: L-BFGS

Projected Cumulative Distributions (PCD)

Options for Reduction to Univariate Case

Reapproximation methods for univariate case readily available. How can we use univariate methods in multivariate case?

Solution

Approximation on principal axis of PDF

Limited to densities where principal axis can be defined

Examples: Gaussian PDF, Bingham PDF on sphere

Does not cover the entire density 🤪

Cartesian product of 1D approximation 👎

Curse of dimensionality (as very similar to grid)

Only for product densities (or rotations thereof)

Inefficient coverage

Representing PDFs by all one-dimensional projections (a.k.a Radon transform) 👍

Represent the two densities $\tilde{f}(\underline{x})$ and $f(\underline{x})$ by infinite set of one-dimensional projections

Projections onto all unit vectors $\underline{u} \in \mathbb{S}^{N-1}$

We obtain sets of projections $\tilde{f}(r \mid \underline{u})$ and $f(r \mid u)$ ($r$: the density along the unit vector)

These are the Radon transforms of $\tilde{f}(\underline{x})$ and $f(\underline{x})$

We compare the sets of projections $\tilde{f}(r \mid \underline{u})$ and $f(r \mid u)$ for every $\underline{u} \in \mathbb{S}^{N-1}$

For comparison, we use the univariate cumulative distribution functions $\tilde{F}(r \mid \underline{u})$ and $F(r \mid u)$

These are unique, well defined, and easy to calculate 👏

Resulting distance measures
$$ D_{1}(\underline{u})=D(\tilde{f}(r \mid \underline{u}), f(r \mid \underline{u})) $$
depend on the projection vector $\underline{u}$

We integrate these one-dimensional distance measures $D_1(\underline{u})$ over all unit vectors $\underline{u} \in \mathbb{S}^{N-1}$

This gives multivariate distance measure $D(\tilde{f}(\underline{x}), f(\underline{x}))$

Typically a discretized subset of $\underline{u} \in \mathbb{S}^{N-1}$ is used

Distance measure minimized via univariate Newton updates

Radon Transform

Represent general $N$-dimensional probability density functions via the set of all one-dimensional projections

Linear projection of random vector $\underline{\boldsymbol{x}} \in \mathbb{R}^{N}$ to scalar random variable $\boldsymbol{r} \in \mathbb{R}$ onto line described by unit vector $\underline{u} \in \mathbb{S}^{N-1}$
$$ \boldsymbol{r} = \underline{u}^\top \underline{\boldsymbol{x}} $$

Given probability density function $f(\underline{x})$ of random vector $\underline{\boldsymbol{x}}$, density $f_r(r \mid \underline{u})$ of $\boldsymbol{r}$ is given by
$$ f_{r}(r \mid \underline{u})=\int_{\mathbb{R}^{N}} f(\underline{t}) \delta\left(r-\underline{u}^{\top} \underline{t}\right) \mathrm{d} \underline{t} $$

$f_r(r \mid \underline{u})$ is Radon transform of $f(\underline{x})$ for all $\underline{u} \in \mathbb{S}^{N-1}$

Visualization:

$u$ is the unit vector.

We project the density on $u$ and we get the projection (yellow area).

Then if we cut through the projection, it gives us the red line.

Dirac Mixture Densities
$$ f(\underline{x} \mid \hat{\mathbf{X}})=\sum_{i=1}^{L} w_{i} \delta\left(\underline{x}-\underline{\hat{x}}_{i}\right) $$
with
$$ \hat{\mathbf{X}}=\left[\underline{\hat{x}}_{1}, \underline{\hat{x}}_{2}, \ldots, \underline{\hat{x}}_{L}\right] $$
Radon transform is given by
$$ f_{r}(r \mid \underline{\hat{r}}, \underline{u})=\sum_{i=1}^{L} w_{i} \delta\left(\underline{u}^{\top} \underline{x} - \underline{u}^{\top} \underline{\hat{x}}_{i}\right)=\sum_{i=1}^{L} w_{i} \delta\left(r-\hat{r}_{i}(\underline{u})\right) $$

$\hat{r}_{i}(\underline{u})=\underline{u}^{\top} \underline{x}_{i}, i=1, \ldots, L$ are the projected Dirac locations

Collect projected Dirac locations $\hat{r}_{i}(\underline{u})$ in vector
$$ \underline{\hat{r}}=\left[\hat{r}_{1}(\underline{u}), \hat{r}_{2}(\underline{u}), \ldots, \hat{r}_{L}(\underline{u})\right]^{\top} $$
Gaussian Densities

For Gaussian Density $f (\underline{x}) $with mean vector $\underline{\hat{x}}$ and covariance matrix $\mathbf{C}_x$, density $f_r(r \mid \underline{u})$ resulting from the projection is also Gaussian

Its mean $\hat{r}(\underline{u})$ can simply be calculated by taking the expected value
$$ \hat{r}(\underline{u})=\mathrm{E}\{\boldsymbol{r}(\underline{u})\}=\mathrm{E}\left\{\underline{u}^{\top} \underline{\boldsymbol{x}}\right\}=\underline{u}^{\top} \underline{\hat{x}} $$

Its standard deviation $\sigma_r(\underline{u})$ is given by
$$ \sigma_{r}^{2}(\underline{u})=\mathrm{E}\left\{(\boldsymbol{r}(\underline{u})-\hat{r}(\underline{u}))^{2}\right\}=\mathrm{E}\left\{\left(\underline{u}^{\top} \boldsymbol{x}-\underline{u}^{\top} \underline{\hat{x}}\right)^{2}\right\}=\mathrm{E}\left\{\underline{u}^{\top}(\boldsymbol{x}-\underline{\hat{x}})(\boldsymbol{x}-\underline{\hat{x}})^{\top} \underline{u}\right\}=\underline{u}^{\top} \mathbf{C}_{x} \underline{u} $$

Gaussian Mixture Densities

For $N$-dimensional Gaussian mixture densities $f(\underline{x})$ of the form
$$ f(\underline{x})=\sum_{i=1}^{M} w_{i} \frac{1}{\sqrt{(2 \pi)^{N}\left|\mathbf{C}_{x, i}\right|}} \exp \left(-\frac{1}{2}\left(\underline{x}-\underline{\hat{x}}_{i}\right)^{\top} \mathbf{C}_{x, i}^{-1}\left(\underline{x}-\underline{\hat{x}}_{i}\right)\right) $$
the density $f_r(r, \underline{u})$ is also a Gaussian mixture

Due to the linearity of the projection operator, it is given by
$$ f_{r}(r \mid \underline{u})=\sum_{i=1}^{M} w_{i} \frac{1}{\sqrt{2 \pi} \sigma_{r, i}(\underline{u})} \exp \left(-\frac{1}{2} \frac{\left(r-\hat{r}_{i}(\underline{u})\right)^{2}}{\sigma_{r, i}^{2}(\underline{u})}\right) $$
with
$$ \hat{r}_{i}(\underline{u})=\underline{u}^{\top} \underline{\hat{x}}_{i} $$
and
$$ \sigma_{r, i}(\underline{u})=\sqrt{\underline{u}^{\top} \mathbf{C}_{x, i} \underline{u}} $$

Multivariate Cramér-von Mises Distance

Multivariate distance measure between two continuous and/or discrete probability density functions

One-dimensional Projections via Radon Transform

Given density $\tilde{f}(\underline{x})$ and its approximation $f(\underline{x})$, represented by their Radon transforms $\tilde{f}(r \mid \underline{u})$ (i.e. by their 1D projections onto unit vectors $\underline{u} \in \mathbb{S}^{N-1}$)

One-dimensional Cumulative Distributions

Based on Radon transform $\tilde{f}(r \mid \underline{u})$, calculate one-dimensional cumulative distributions of the projected densities as
$$ \tilde{F}(r \mid \underline{u})=\int_{\infty}^{r} \tilde{f}(t, \underline{u}) \mathrm{d} t $$
and similarly for $F(r \mid \underline{u})$

Example: For Dirac mixture approximation, cumulative distribution function of its Radon transform is given by
$$ F(r \mid \underline{\hat{r}}, \underline{u})=\sum_{i=1}^{L} w_{i} \mathrm{H}\left(r-\hat{r}_{i}(\underline{u})\right) $$

One-dimensional Distance

For comparing the one-dimensional projections, we compare their cumulative distributions $\tilde{F}(r \mid \underline{u})$ and $F(r \mid \underline{\hat{r}}, \underline{u})$ for all $\underline{u} \in \mathbb{S}^{N-1}$

As distance measure use integral squared distance
$$ D_{1}(\underline{\hat{r}}, \underline{u})=\int_{\mathbb{R}}[\tilde{F}(r \mid \underline{u})-F(r \mid \underline{\hat{r}}, \underline{u})]^{2} \mathrm{~d} r $$

Gives distance between the projected densities in the direction of the unit vector $\underline{u}$ for all $\underline{u}$

One-dimensional Newton Step

Newton step can now be written as

$$ \Delta \underline{\hat{r}}(\underline{\hat{r}}, \underline{u})=-\mathbf{H}^{-1}(\underline{\hat{r}}, \underline{u}) \underline{G}(\underline{\hat{r}}, \underline{u}) $$
with
$$ \begin{aligned} \underline{G}(\underline{\hat{r}}, \underline{u})&=\left[\frac{\partial D_{1}(\underline{\hat{r}}, \underline{u})}{\partial \hat{r}_{1}}, \frac{\partial D_{1}(\underline{\hat{r}}, \underline{u})}{\partial \hat{r}_{2}}, \ldots, \frac{\partial D_{1}(\underline{\hat{r}}, \underline{u})}{\partial \hat{r}_{L}}\right]^{\top} \\ \frac{\partial D_{1}(\underline{\hat{r}}, \underline{u})}{\partial \hat{r}_{i}}&=2 w_{i}\left[\tilde{F}\left(\hat{r}_{i} \mid \underline{u}\right)-F\left(\hat{r}_{i} \mid \underline{u}\right)\right] \end{aligned} $$

Hessian $\mathbf{H}(\underline{r}, \underline{u})$ is given by
$$ \mathbf{H}(\underline{\hat{r}}, \underline{u})=2 \operatorname{diag}\left(\left[w_{1} \tilde{f}\left(\hat{r}_{1} \mid \underline{u}\right), w_{2} \tilde{f}\left(\hat{r}_{2} \mid \underline{u}\right), \ldots, w_{L} \tilde{f}\left(\hat{r}_{L} \mid \underline{u}\right)\right]\right) $$

$\Rightarrow$ Resulting Newton step
$$ \Delta \underline{\hat{r}}(\underline{\hat{r}}, \underline{u})=-\left[\frac{\tilde{F}\left(\hat{r}_{1} \mid \underline{u}\right)-F\left(\hat{r}_{1} \mid \underline{\hat{r}}, \underline{u}\right)}{\tilde{f}\left(\hat{r}_{1} \mid \underline{u}\right)}, \ldots, \frac{\tilde{F}\left(\hat{r}_{L} \mid \underline{u}\right)-F\left(\hat{r}_{L} \mid \underline{\hat{r}}, \underline{u}\right)}{\tilde{f}\left(\hat{r}_{L} \mid \underline{u}\right)}\right]^{\top} $$

Backprojection to $N$-dimensional space

For specific projection vector $\underline{u}$: Newton update $\Delta \underline{\hat{r}}(\underline{\hat{r}}, \underline{u})$

Backprojection into original $N$-dimensional space: Update can be used to modify original Dirac locations in direction along the vector $\underline{u}$

For every location vector $\underline{\hat{x}}_i$ we obtain
$$ \Delta \underline{\hat{x}}_{i}(\underline{u})=\Delta \underline{\hat{r}}(\underline{\hat{r}}, \underline{u}) \cdot \underline{u} $$

Assemble Multivariate Distance

Individual 1D distances $D_1(\underline{r}, \underline{u})$ can be assembled to form multivariate distance measure

Performed by integrating over all 1D distances depending on unit vector $\underline{u}$
$$ D_{N}(\hat{\mathbf{X}})=\int_{\mathbb{S}^{N-1}} D_{1}(\underline{\hat{r}}, \underline{u}) \mathrm{d} \underline{u} $$

Plugging in $D_1(\underline{r}, \underline{u})$:
$$ D_{N}(\hat{\mathbf{X}})=\int_{\mathbb{S}^{N-1}} \int_{\mathbb{R}}[\tilde{F}(r \mid \underline{u})-F(r \mid \underline{\hat{r}}, \underline{u})]^{2} \mathrm{~d} r \mathrm{~d} \underline{u} \quad \text { with } r=\underline{u}^{\top} \cdot \underline{x} $$

Perform Full Newton Update

Full Newton update by integrating over all partial updates along projection vectors $\underline{u}$
$$ \Delta \underline{\hat{x}}_{i}=\int_{\mathbb{S}^{N-1}} \Delta \underline{\hat{x}}_{i}(\underline{u}) \mathrm{d} \underline{u} $$

Discretization of Space of Unit Vectors

In practice, Space $\mathbb{S}^{N-1}$ containing unit vectors $\underline{u}$ has to be discretized

Two options are available for performing the discretization:

Deterministic discretization, e.g., by calculating a grid

Random discretization by drawing uniform samples from the hypersphere

For both cases: Consider $K$ samples $\underline{u}_k$

Integration reduces to summation
$$ \Delta \underline{\hat{x}}_{i} \approx \frac{1}{K} \sum_{k=1}^{K} \Delta \underline{\hat{x}}_{i}\left(\underline{\hat{u}}_{k}\right) \quad \text { for } i=1,2, \ldots, L $$

Stopping criterion

Given initial locations for the location of the Dirac components

Full Newton updates are performed until the maximum change over all location vectors
$$ \max _{i}\left|\Delta \underline{\hat{x}}_{i}\right| $$
falls below a given threshold

Complete Algorithm (Randomized Variant)

Partikel Filter

Sun, 14 Aug 2022 00:00:00 +0000

Naives Partikelfilter

Prädiktions- und Filterschritt

Übungsblatt Aufg. 13.2

Prädiktionsschritt

Zum Startzeitpunkt (z.B. $k=0$): Initiale Samples gegeben
$$ f_{k}^{e}\left(\underline{x}_{k}\right)=\sum_{i=1}^{L} w_{k}^{e, i} \cdot \delta\left(\underline{x}_{k}-\underline{\hat{x}}_{k}^{e, i}\right) \qquad w_{k}^{e, i}=\frac{1}{L}, i \in\left\{1, \ldots, L\right\} $$
Prädiktion mithilfe des probabilistischen Systemmodells $f(\underline{x}_{k+1} \mid \underline{x}_k)$

Ziehe Samples zum Zeitpunkt $k+1$
$$ \underline{\hat{x}}_{k+1}^{p, i} \sim f\left(\underline{x}_{k+1} \mid \hat{x}_{k}^{e, i}\right) $$

Gewichte bleiben gleich
$$ w_{k+1}^{p, i} = w_{k}^{e, i} $$

$\Rightarrow$
$$ f_{k+1}^{p}\left(\underline{x}_{k+1}\right)=\sum_{i=1}^{L} w_{k+1}^{p, i} \delta\left(\underline{x}_{k+1}-\underline{\hat{x}}_{k+1}^{p, i}\right) $$
Bei gegebenen geschriebenen Systemmodell
$$ \underline{x}_{k+1} = \underline{a}_k(\underline{x}_k, \underline{w}_k) $$

Ziehe $\underline{w}_k^i \sim f_k^w(\cdot)$

$\underline{\hat{x}}_{k+1}^{p, i}=\underline{a}_{k}\left(\underline{\hat{x}}_{k}^{e, i}, \underline{w}_{k}^{i}\right), i \in\left\{1, \ldots,L\right\}$

Filterschritt

Filterung mithilfe des probabilistischen Messmodells $f(\underline{y}_k \mid \underline{x}_k)$, falls Messung verfügbar.

Messupdate
$$ \begin{aligned} f_{k}^{e}\left(\underline{x}_{k}\right) &\propto f\left(\underline{y}_{k} \mid \underline{x}_{k}\right) \cdot f_{k}^{p}\left(\underline{x}_{k}\right)\\ &=f\left(\underline{y}_{k} \mid \underline{x}_{k}\right) \cdot \sum_{i=1}^{L} w_{k}^{p, i} \cdot \delta\left(\underline{x}_{k}-\underline{\hat{x}}_{k}^{p, i}\right)\\ &=\sum_{i=1}^{L} \underbrace{w_{k}^{p, i} \cdot f\left(\underline{y}_{k} \mid \hat{\underline{x}}_{k}^{p, i}\right)}_{\propto w_{k}^{e, i}} \cdot \delta(\underline{x}_{k}-\underbrace{\underline{\hat{x}}_{k}^{p, i}}_{\underline{\hat{x}}_{k}^{e, i}}) \end{aligned} $$

Positionen bleiben gleich
$$ \underline{\hat{x}}_{k}^{e, i} = \underline{\hat{x}}_{k}^{p, i} $$

Gewichte werden adaptiert
$$ w_{k}^{e, i} \propto w_{k}^{p, i} \cdot f\left(\underline{y}_{k} \mid \hat{\underline{x}}_{k}^{p, i}\right) $$

Normalisierung erforderlich
$$ w_{k}^{e, i}:=\frac{w_{k}^{e, i}}{\displaystyle \sum_{i} w_{k}^{e,i}} $$

Ablauf

Gewichte sind repräsentiert mit Kreise.

Vor- und Nachteile

👍 Vorteile

Problemlose Behandlung vom nichtlinearen System- und Messmodellen

Einstellbare Genauigkeit und Rechenaufwand nach Anzahl der Partikel balancieren

Extreme einfache Implementierung

👎 Nachteile

Varianz der Samples erhöht sich mit Filterschritten

Partikel sterben aus $\rightarrow$ Degenerierung des Filters

Aussterben schneller, je genauer die Messung, da Likelihood schmaler ($\rightarrow$ Paradox!)

Verbesserungen

Resampling

Maßnahme zur Veringerung der Varianz der Samples

Approximation der gewichteter Samples durch ungewichtete
$$ f_{k}^{e}\left(\underline{x}_{k}\right)=\sum_{i=1}^{L} w_{k}^{e, i} \cdot \delta\left(\underline{x}_{k}-\underline{\hat{x}}_{k}^{e, i}\right) \approx \sum_{i=1}^{L} \frac{1}{L} \delta\left(\underline{x}_{k}-\underline{\hat{\hat{x}}}_{k}^{e, i}\right) $$
$\underline{\hat{\hat{x}}}_{k}^{e, i}$ : i.d.R nicht die gleiche Orte wie $\underline{\hat{x}}_{k}^{e, i}$

Sehr einfaches Verfahren

Verwerfen von Samples mit kleinen Gewichten

Duplizieren von Samples mit hohen Gewichten proportional zu $w_i$ (importance resampling)

Positionen der Samples nicht verändert

Veränderung der Position erst im nachfolgenden Prädiktionsschritt

Partikelfilter mit Resampling

Fangen mit Samples der gleichen Gewichte an

In $k=1$

Propagieren durch Prädiktionsschritt. Die Orte werden verändert, während die Geweichte gleich bleiben.

In Filterschritt, verändert die Gewichte. Orte bleiben gleich.

Die größeren Sample werden repliziert. Die ganz kleinere werden weg.

Verschiedene Techniken für Resampling

This gives a much clearer explanation. 👍

Gegeben: $L$ Partikel mit Gewichten $w_i$

Gesucht: $L$ Partikel mit Geweichte $\frac{1}{L}$ (gleichgewichtet)

Achtung

Hier nur Vervielfältigung

Positionen der Partikel unverändert

Kann als Kategoriale Verteilung gesehen werden

Rouletterad

Resampling proportional zu der Gewichten $w_i$

Betrachtung der kumulative Verteilung $F(i)$

Ziehe $L$-mal Zufallszahl $u$ und wähle größte $i$ mit $F(i) \leq u$

Entspricht Auswahl mit Rouletterad (z.B. hier $L=5$)

Problem

Sehr kleine Gewichte nicht ausreichend proportional gezogen werden.

Man bevorzugt ganz große Gewichte.

Given: Set $S$ of weighted samples

Wanted: Random sample, where the probability of drawing $x_i$ is given by $w_i$

Typically done $n$ times with replacement to generate new sample set $S^\prime$

We have a roulette ring, where the arc length is proportional to the weight.

We can think of the sampling as following:

We just randomly pick a direction. If I hit $w_3$, then I will take sample Nr. 3

And repeat this for $n$ times.

Stochastic Universal Sampling

Üb A13.1 (a)

Bisher

starkes Rauschen $\rightarrow$ Auswahl variiert stark

Bevorzugung großer Gewichte

Daher: Determistisches Auswahl

Randomisierung durch einmaliges Ziehen von $\epsilon \in [0, \frac{1}{2}]$
$$ u_i = \frac{i}{L} - \epsilon \qquad i \in \{1, \dots, L\} $$
Für $\epsilon = \frac{1}{2L}$:
$$ u_i = \frac{2i - 1}{2L} $$

Bsp: $L=5$

We model the roulette wheel like a wagen wheel:

We can make a set of spokes that are $\frac{1}{n}$ full rotation around.

We can randomly put it down someplace, and read off, which $w$ did each spoke hit.

Compared to roulette wheel:

Importance Sampling

🎯 Ziel: Berechung des Integrals
$$ E = \int_{\mathbb{R}^N} g(\underline{x}) f(\underline{x}) d\underline{x} $$

$E$: Erwartungswert

$g(\underline{x})$ : nichtlineare Funktion

$f(\underline{x})$ : Verteilungsdichte

Falls Samples von $f(\underline{x})$ verfügbar:
$$ E=\int_{\mathbb{R}^{N}} g(\underline{x}) \sum_{i=1}^{L} w_{i} \cdot \delta\left(\underline{x}-\underline{\hat{x}}_{i}\right) d \underline{x}=\sum_{i=1}^{L} w_{i} g(\underline{x}_i) $$
Aber: Oft Sampling von $f(\underline{x})$ nicht möglich 🤪

Abhilfe: Proposal distribution (a.k.a instrumental distribution, importance distribution) $p(\underline{x})$ mit
$$ \operatorname{supp}\{f(\cdot)\} \subset \operatorname{supp}\{p(\cdot)\} $$
($\operatorname{supp}$ steht für support)

d.h. $p(\underline{x}) > 0$ falls $f(\underline{x}) > 0$.

Für $p(\underline{x})$ müssen wir so auswählen, dass Sampling von $p(\underline{x})$ einfach ist (z.B. Gaußdichte).

Einsetzen:
$$ E=\int_{\mathbb{R}^{N}} g(\underline{x}) \cdot \frac{p(\underline{x})}{p(\underline{x})} \cdot f(\underline{x}) d \underline{x}=\int_{\mathbb{R}^{N}} g(\underline{x}) \cdot \frac{f(\underline{x})}{p(\underline{x})} \cdot p(\underline{x}) d \underline{x} $$
Jetzt würden wir nicht $f(\underline{x})$ in eine Dirac Mixture entwickeln, sondern $p(\underline{x})$ . Davon können wir samplen.
$$ p(\underline{x}) \approx \sum_{i=1}^{L} w_{i} \cdot \delta\left(\underline{x}-\underline{\hat{x}}_{i}\right) $$ $$ \begin{aligned} \Rightarrow E &\approx \int_{\mathbb{R}^{N}} g(\underline{x}) \cdot \frac{f(\underline{x})}{p(\underline{x})} \cdot \sum_{i=1}^{L} w_{i} \delta\left(\underline{x}-\underline{\hat{x}}_{i}\right) d \underline{x} \\\\ &= \sum_{i=1}^{L} g(\underline{\hat{x}}_{i}) \cdot \underbrace{\frac{f(\underline{\hat{x}}_i)}{p(\underline{\hat{x}}_{i})} \cdot w_i}_{=: w_{i}^\prime} \\\\ &= \sum_{i=1}^{L} w_{i}^\prime \cdot g(\underline{\hat{x}}_{i}) \end{aligned} $$
Konvergenz gegen $E$ für $L \to \infty$

I.e. Wir teilen den Ausdruck so auf, dass wir Sample $\underline{\hat{x}}_i$ von der Proposal $p(\underline{x})$ sampeln und ihr ursprüngliches Gewicht $w_i$ mit “Importance” $\frac{f(\underline{\hat{x}}_i)}{p(\underline{\hat{x}}_{i})} $ anpassen.

Check for a clear explanation and visualization.

Sequential Importance Sampling

Übungsblatt Aufg. 13.3

Vor-positionierung von Samples

🎯 Ziel: Systematische und korrekte Positionierung der Samples an Stellen hoher Likelihood vor Filterschritt

Verwendung von Proposal statt Systemmodell $f(\underline{x}_{k+1} \mid \underline{x}_k)$

💡Idee: Importance Sampling für $f(\underline{x}_k, \underline{x}_{k-1} \mid \underline{y}_{1:k})$ (die Messung wird auch in Berücksichtigung genommen)
$$ f_{k}^{e}\left(\underline{x}_{k}\right)=f\left(\underline{x}_{k} \mid \underline{y}_{1: k}\right)=\int_{\mathbb{R}^{N}} \cdots \int_{\mathbb{R}^{N}} f\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right) d \underline{x}_{1: k-1} $$
Proposal: $p\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right)$ hängt auch von $\underline{y}_k$ ab.

Damit:
$$ f_{k}^{e}\left(\underline{x}_{k}\right) = \int_{\mathbb{R}^{N}} \cdots \int_{\mathbb{R}^{N}} \underbrace{\frac{f\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right)}{p\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right)}}_{=: w_k^{e, i}} p\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right) d \underline{x}_{1: k-1} $$
Annahme: Proposal ist faktorisierbar
$$ p\left(\underline{x}_{1: k} \mid \underline{y}_{1: k}\right)=p\left(\underline{x}_{k} \mid \underline{x}_{1: k - 1}, \underline{y}_{1: k}\right) \cdot p\left(\underline{x}_{1: k -1} \mid \underline{y}_{1: k - 1}\right) $$
Für gegebenes Sample $\underline{\hat{x}}_{k-1}^{e, i}$ aus letzten Zeitpunkt, ziehe
$$ \underline{x}_{k}^{e, i} \sim P\left(\underline{x}_{k} \mid \hat{\underline{x}}_{k-1}^{e, i}, \underline{y}_{k}\right) $$
Jetzt umschreiben von $\frac{f\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right)}{p\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right)}$

Zähler
$$ \begin{aligned} f\left(\underline{x}_{1: k} \mid \underline{y}_{1: k}\right) &\propto f\left(\underline{y}_{k} \mid \underline{x}_{1: k}, \underline{y}_{1: k - 1}\right) \cdot f\left(\underline{x}_{1: k} \mid \underline{y}_{1: k-1}\right)\\ &=f\left(\underline{y}_{k} \mid \underline{x}_{k}\right) \cdot f\left(\underline{x}_{k} \mid \underline{x}_{1:k-1}, \underline{y}_{1:k-1}\right) \cdot f\left(\underline{x}_{1:k-1} \mid \underline{y}_{1: k-1}\right)\\ &=f\left(\underline{y}_{k} \mid \underline{x}_{k}\right) \cdot f\left(\underline{x}_{k} \mid \underline{x}_{k-1}\right) \cdot f\left(\underline{x}_{1: k-1} \mid \underline{y}_{1: k \cdot 1}\right) \end{aligned} $$

Nenner
$$ p\left(\underline{x}_{1: k} \mid \underline{y}_{1: k}\right)=p\left(\underline{x}_{k} \mid \underline{x}_{1: k - 1}, \underline{y}_{1: k}\right) \cdot p\left(\underline{x}_{1: k -1} \mid \underline{y}_{1: k - 1}\right) $$

$\Rightarrow$ Gewicht für Position $i$:
$$ w_k^{e, i} = \frac{f\left(\underline{\hat{x}}_{1: k} \mid \underline{y}_{1 : k}\right)}{p\left(\underline{\hat{x}}_{1: k} \mid \underline{y}_{1 : k}\right)} \propto \frac{f\left(\underline{y}_{k} \mid \underline{x}_{k}^i\right) \cdot f\left(\underline{x}_{k}^i\mid \underline{x}_{k-1}^i\right)}{p\left(\underline{x}_{k}^i \mid \underline{x}_{1: k - 1}^i, \underline{y}_{1: k}\right)} \cdot \underbrace{\frac{f\left(\underline{x}_{1: k-1}^i \mid \underline{y}_{1: k \cdot 1}\right)}{p\left(\underline{x}_{1: k -1}^i \mid \underline{y}_{1: k - 1}\right)}}_{=w_{k-1}^{e, i}} $$
mit Normalisierung.

Spezielle Proposal

Standard Proposal

Einfache Verwendung der Systemdynamik:
$$ p\left(\underline{x}_{k} \mid \underline{x}_{k-1}, \underline{y}_{k}\right) \stackrel{!}{=} f\left(\underline{x}_{k} \mid \underline{x}_{k-1}\right) $$
Es ergibt sich
$$ w_{k}^{e, i} \propto \frac{f\left(\underline{y}_{k} \mid \hat{\underline{x}}_{k}^{i}\right) \cdot f\left(\hat{\underline{x}}_{k}^{i} \mid \hat{\underline{x}}_{k-1}^{i}\right)}{p\left(\underline{\hat{x}}_{k}^{i} \mid \hat{\underline{x}}_{k-1}^{i}, \underline{y}_k\right)} \cdot w_{k-1}^{e, i}=f\left(\underline{y}_{k} \mid \hat{\underline{x}}_{k}^{i}\right) \cdot w_{k - 1}^{e, i} $$
Sehr einfach, aber KEINE verbesserte Performance 🤪

Optimales Proposal

Verwende
$$ \begin{aligned} p\left(\underline{x}_{k} \mid \underline{x}_{k-1}, \underline{y}_{k}\right) &=f\left(\underline{x}_{k} \mid \underline{x}_{k-1}, \underline{y}_{k}\right) \\ & \propto f\left(\underline{y}_{k} \mid \underline{x}_{k}\right) \cdot f\left(\underline{x}_{k} \mid \underline{x}_{k-1}\right) \end{aligned} $$
Damit wäre
$$ w_k^{e, i} = w_{k-1}^{e, i} $$
Wird als optimales Proposal genannt

Minimierte Varianz der Gewicht

Varianz der Gewicht ändert sich nicht

‼️ Aber typischerweise können wir hiervon nicht samplen $\rightarrow$ Nur in Spezialfällen verwendbar.

Einfaches praktisches Filter: SIR-Partikelfilter

Standard Proposal

Resampling nach jedem Filterschritt

Da Gewichte in Prädiktionsschritt unverändert
$$ w_{k-1}^{e, i} = \frac{1}{L} $$
und damit
$$ w_k^{e, i} \propto f(\underline{y}_k \mid \underline{\hat{x}}_k^i) $$

Einfachstes praktisches Partikelfilter

Algorithm

Input

$\underline{\hat{x}}_{k-1}^{e, i}$

$w_{k-1}^{e, i} = \frac{1}{L}, i \in \{1, \dots, L\}$

For $i \in \{1, \dots, L\}$

Ziehe
$$ \underline{\hat{x}}_{k-1}^{e, i} \sim f(\underline{x}_k \mid \underline{\hat{x}}_{k-1}^i) $$

Gewichtung
$$ w_k^{e, i} \propto f(\underline{y}_k \mid \underline{\hat{x}}_{k}^{e, i}) $$

Normalisierung Gewichte $w_k^{e, i}$

Resampling
$$ \underline{\hat{x}}_{k}^{e, i}, \quad w_{k}^{e, i} = \frac{1}{L} \qquad i \in \{1, \dots, L\} $$

Gauß Rechenregeln

Mon, 15 Aug 2022 00:00:00 +0000

Produkt zweier Gaußdichten

Gegeben
$$ \begin{aligned} f_1(x) &= \frac{1}{\sqrt{2\pi} \sigma_1} \exp\left\{-\frac{1}{2} \frac{(x - m_1)^2}{\sigma_1^2}\right\} \\ f_2(x) &= \frac{1}{\sqrt{2\pi} \sigma_2} \exp\left\{-\frac{1}{2} \frac{(x - m_2)^2}{\sigma_2^2}\right\} \end{aligned} $$
Gesucht:
$$ \begin{aligned} f(x) &= \frac{1}{\sqrt{2\pi} \sigma} \exp\left\{-\frac{1}{2} \frac{(x - m)^2}{\sigma_1^2}\right\} \\\\ &\propto f_1(x) \cdot f_2(x) = \frac{1}{\sqrt{2\pi} \sigma_1}\frac{1}{\sqrt{2\pi} \sigma_2} \cdot e^{-\frac{1}{2} \frac{(x - m_1)^2}{\sigma_1^2}} e^{-\frac{1}{2} \frac{(x - m_2)^2}{\sigma_2^2}} \end{aligned} $$
Exponent:
$$ \begin{aligned} &\frac{\left(x-m_{1}\right)^{2}}{\sigma_{1}^{2}}+\frac{\left(x-m_{2}\right)^{2}}{\sigma_{2}^{2}} \overset{!}{=} \frac{(x-m)^{2}}{\sigma^{2}}+2C \\\\ &\frac{x^{2}-2 m_{1} x+m_{1}^{2}}{\sigma_{1}^{2}}+\frac{x-2 m_{2} x+m_{2}^{2}}{\sigma_{2}{ }^{2}} \stackrel{!}{=} \frac{x^{2}-2 mx+m^{2}}{\sigma^{2}}+2 C \\\\ &x^{2}\left(\frac{1}{\sigma_{1}^{2}}+\frac{1}{\sigma_{2}^{2}}-\frac{1}{\sigma^{2}}\right)-2\left(\frac{m_{1}}{\sigma_{1}^{2}}+\frac{m_{2}}{\sigma_{2}^{2}}-\frac{m}{\sigma^{2}}\right) \cdot x +\frac{m_{1}^{2}}{\sigma_{1}^{2}}+\frac{m_{2}^{2}}{\sigma_{2}^{2}}-\frac{m^{2}}{\sigma^{2}}-2 C \stackrel{!}{=} 0 \end{aligned} $$
Ergebnis:
$$ \begin{aligned} \sigma^{2}&=\frac{1}{\frac{1}{\sigma_{1}^{2}}+\frac{1}{\sigma_{2}^{2}}}=\frac{\sigma_{1}^{2} \sigma_{2}^{2}}{\sigma_{1}^{2}+\sigma_{2}^{2}} \\\\ m &= \sigma^2 \left(\frac{m_1}{\sigma_1^2} + \frac{m_2}{\sigma_2^2} \right)\\\\ 2C &= \frac{m_1^2}{\sigma_1^2} + \frac{m_2^2}{\sigma_2^2} - \frac{m^2}{\sigma^2} = \frac{(m_1 - m_2)^2}{\sigma_1^2 + \sigma_2^2} \end{aligned} $$
(See also: Product of Two Gaussian PDFs)

Andere Darstellung:
$$ \begin{aligned} f(x) &\propto \frac{1}{\sqrt{2\pi} \sigma_1}\frac{1}{\sqrt{2\pi} \sigma_2} \cdot e^{-\frac{1}{2} \frac{(m_1 - m_2)^2}{\sigma_1^2 + \sigma_2^2}} e^{-\frac{1}{2} \frac{(x - m)^2}{\sigma^2}} \\\\ &= \underbrace{\frac{1}{\sqrt{2\pi} \sqrt{\sigma_1^2 + \sigma_2^2}} e^{-\frac{1}{2} \frac{(m_1 - m_2)^2}{\sigma_1^2 + \sigma_2^2}}}_{\text{Gewicht (norm.)}} \cdot \underbrace{\frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{1}{2} \frac{(x - m)^2}{\sigma^2}}}_{\text{Ergebnisdichte (n orm.)}} \end{aligned} $$
Dekomposition einer Gaußdichten

Gegeben: Gaußdichte mit $m, \sigma$

Gesucht: Dekomposition, d.h. mögliche Werte für $m_1, m_2, \sigma_1, \sigma_2$
$$ \begin{aligned} \frac{1}{\sigma^2} &= \frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2} \\\\ \Rightarrow \kappa^2 &= \kappa_1^2 + \kappa_2^2 \\\\ \Rightarrow \kappa_1^2 &= (1 - \gamma)\kappa^2, \kappa_2^2 = \gamma \cdot \kappa^2 \qquad \gamma \in [0, 1] \end{aligned} $$ $$ m=\frac{1}{\kappa^{2}}\left((1-\gamma) \kappa^{2} \cdot m_{1}+\gamma \kappa^{2} \cdot m_{2}\right)=(1-\gamma) \cdot m_{1}+\gamma \cdot m_{2} \tag{*} $$
Gilt offennsichtlich für $m_1 = m_2 = m$ , aber auch Wahl von $m_1, m_2$ nach $(*)$ möglich

Progressive Filterung

Mon, 15 Aug 2022 00:00:00 +0000

Systematisches Resampling

Gegeben: Priore Dirac Mixture
$$ f_{p}(\underline{x})=\sum_{i=1}^{L} w_{i}^{p} \delta(\underline{x}-\underline{\hat{x}}_i^p) $$

Filterschritt (Bayes)
$$ \tilde{f}_e(\underline{x}) \propto f_{p}(\underline{x}) \cdot f_{L}(\underline{x})=\sum_{i=1}^{L} \underbrace{w_{i}^{p} \cdot f_{L}\left(\hat{\underline{x}}_{i}^{p}\right)}_{w_{i}^{e}} \cdot \delta(\underline{x} - \underbrace{\underline{\hat{x}}_{i}^{p}}_{\underline{x}_{i}^{e}}) $$

Probleme

Falls Support / Träger von $f_L(\cdot)$ kleiner als Support von $f_p(\cdot)$, sterben viele Partikel aus!

Positionen $\underline{\hat{x}}_i^e$ sterben aus!

Lösung

Progressive Verarbeitung

Reapproximation durch Optimierung

Progressive Filterung

Progressiv = Der Filterschritt wird nicht in einen Schlag verwendet, sondern wir verwenden mehrere Likelihood, um die Filterung durchzuführen.

Effektives Support:
$$ \alpha_{\varepsilon}(f(\cdot))=\{x: f(x) \geqslant \varepsilon\} \qquad (\alpha-\text{Schritt bei } \epsilon) $$
Gegeben: Likelihood $f_L(\underline{x})$ mit $\alpha_{\varepsilon}(f_L(\cdot)) \ll \alpha_{\varepsilon}(f_p(\cdot))$

Dekomposition von $f_L(\underline{x}) $
$$ f_L(\underline{x}) = f_L^1(\underline{x}) \cdot f_L^2(\underline{x}) \cdots f_L^k(\underline{x}) $$
Der Produkt von Dichten: $f_L^i(\underline{x})$ “breiter” als $f_L(\underline{x})$ $\rightarrow$ Effektives Support ist größer ($\alpha_{\varepsilon}(f_L^i(\cdot)) > \alpha_{\varepsilon}(f_L(\cdot))$ )

Note: Dekomposition ist NICHT eindeutig.

Damit kann Filterschritt dekomponiert werden:

In jedem Schritt Gewichtung der prioren Dirac Mixture

Reapproximation nach jedem Teil-Filterschritt

Algorithms:

$f_e^0 (\underline{x}) = f_p(\underline{x})$

For $i \in \{1, \dots, k\}$
$$ \begin{aligned} \tilde{f}_e^i(\underline{x}) &= f_e^{i-1}(\underline{x}) \cdot f_L^i((\underline{x})) \text{ (gewicht)} \quad \to f_e(\underline{x}) \\ f_e^{i}(\underline{x}) &= \operatorname{Reapproximate}(\tilde{f}_e^i(\underline{x})) \text{ (ungewicht)} \quad \to = f_e^k(\underline{x}) \end{aligned} $$

Reapproximation

Ziel

Gegeben: Gewichtete Dirac Mixture
$$ \tilde{f}(\underline{x}) = \sum_{i=1}^L \tilde{w}_i \cdot \delta(\underline{x} - \underline{\hat{x}}_i) $$

Gesucht: Ungewichtete Dirac Mixture
$$ \tilde{f}(\underline{x}) \approx f(\underline{x}) = \sum_{i=1}^L \frac{1}{L} \cdot \delta(\underline{x} - \underline{\hat{x}}_i) $$

Gütemaß: Distanz $D(\tilde{f}(\underline{x}) , f(\underline{x}))$

Aber Abstand zwischen Dirac Mixtures in Dichtebereich schwierig 🤪 $\rightarrow$ Wir betrachten die Kumulative Verteilung $\tilde{F}(\underline{x}), F(\underline{x})$
$$ \begin{aligned} \tilde{F}(\underline{x}) &= \sum_{i=1}^L \tilde{w}_i \cdot H(\underline{x} - \underline{\hat{x}}_i) \\ F(\underline{x}) &= \sum_{i=1}^L \frac{1}{L} \cdot H(\underline{x} - \underline{\hat{x}}_i) \\ \end{aligned} $$

Herausforderungen

Minimalbeispiel: Approximation von zwei Dirac Komponenten durch eine Komponent
$$ \begin{aligned} \tilde{F}(\underline{x}) &= w_1 \cdot H(x - \tilde{x}_1) + w_2 \cdot H(x - \tilde{x}_2) \qquad w_1, w_2 > 0, w_1 + w_2 = 1 \\ F(x) &= H(x - \hat{x}) \end{aligned} $$
Cramér–von Mises Distanz:
$$ D=\int_{-\infty}^{\infty}[\tilde{F}(x)-F(x)]^{2} d x=\left(\hat{x}-\tilde{x}_{1}\right) \cdot w_{1}^{2}+\left(\hat{x}-\tilde{x}_{2}\right) \cdot w_{2}^{2} \quad \text{für} \quad \tilde{x}_{1} \leq \hat{x} \leq \tilde{x}_{2} $$ $$ \frac{\partial D}{\partial \hat{x}} = w_1^2 + w_2^2 $$
D.h., Für alle $\hat{x}$ mit $\tilde{x}_{1} \leq \hat{x} \leq \tilde{x}_{2}$ , $D$ is immer minimiert $\rightarrow$ NICHT eindeutig!

Wasserstein-Distanz
$$ D=\int_{0}^{1}\left[\tilde{F}^{-1}(y)-F^{-1}(y)\right]^{2} d y=w_{1}\left(\hat{x}-\tilde{x}_{1}\right)^{2}+w_{2}\left(\hat{x}-\tilde{x}_{2}\right)^{2} $$ $$ \begin{aligned} &\frac{\partial D}{\partial{x}}=2 w_{1}\left(\hat{x}-\tilde{x}_{1}\right)+2 w_{2}\left(\hat{x}-\tilde{x}_{2}\right) \\ &\Rightarrow \hat{x}=\frac{w_{1} \cdot \tilde{x}_{1}+w_{2} \tilde{x}_{2}}{w_{1}+w_{2}} \quad \text{(Gewichteter Mittelwert)} \end{aligned} $$
Allgemeines Verfahren
$$ \begin{aligned} &\hat{x}_{1}=\frac{w_{1} \cdot \tilde{x}_{1}+\left(0.5-w_{1}\right) \cdot \tilde{x}_{2}}{0.5} \\ &\hat{x}_{2}=\frac{\left(w_{1}+w_{2}-0.5\right) \tilde{x}_{2}+\left(1-w_{1}-w_{2}\right) \tilde{x}_{3}}{0.5} \end{aligned} $$
Gesamtverfahren: Progressives Filterverfahren mit laufender Reapproximation

Zusammenfassung

Thu, 18 Aug 2022 00:00:00 +0000

Mindmap

Wed, 14 Sep 2022 00:00:00 +0000

Allgemeine Fragen

Thu, 18 Aug 2022 00:00:00 +0000

Vorlesung in eigenen Worten zusammenfassen

Die SI Vorlesung vermittelt die fundamentalen und formalen Grundlagen der Zustandsschätzung rund um Prädiktion und Filterung.

Vier behandelten Typen von Systemen

erläutern

Nennen

Zusammenhänge

Unterschiede

Limitierungen

Komplexität einer Implementierung der zugehörigen Schätzer

4 Type von Systeme

Wertediskrete Systeme

Wertekontinuierliche lineare Systeme

Wertekontinuierliche und schwach nichtlineare Systeme

Allgemeine Systeme

Wann kann man mit 1D-Messungen auch auf einen 3D-Zustand schließen? Wie sehen dann die Unsicherheits-Ellipsen über der Zeit aus?

Definition

Induzierte Nichtlinearität

Bedingte Unabhängigkeit

Zwei Variable $A, B$ sind bedingt unabhängig, gegeben $C$ $\Leftrightarrow$
$$ P(A, B | C) = P(A | C) P(B | C) $$
Damit äquivalent sind die Formulierungen: $$ P(A | B,C) = P(A | C) \qquad P(B | A,C) = P(B | C) $$

Zustand

(Script P19)

The state of a dynamic system is defined as the smallest set of variables, the so called state variables, that completely determine the behavior of the system for $t \geq t_0$ given their values at $t_0$ together with the input function for $t \geq t_0$.

When modeling a system, the choice of state variables is not unique.

State variables do not need be physically existent. They also do not need to be measurable.

Der Zustand eines dynamischen Systems ist definiert als die kleinste Menge von Variablen, den so genannten Zustandsvariablen, die das Verhalten des Systems für $t \geq t_0$ vollständig bestimmen/beschreiben, wenn man ihre Werte bei $t_0$ zusammen mit der Eingangsfunktion für $t \geq t_0$ betrachtet.

Zustandsschätzung

Rekonstruktion des internen Zustands aus Messungen und Eingängen

Komplexität einer Rekursion

Dichtefunktion, Likelihood

Verteilungsfunktion oder kumulative Wahrscheinlichkeitsdichte $F_{\boldsymbol{x}}(x)$ der Zufallsvariablen $\boldsymbol{x}$
$$ F_{\boldsymbol{x}}: \mathbb{R} \rightarrow[0,1] \qquad F_{\boldsymbol{x}}(x):=\mathrm{P}(\boldsymbol{x} \leq x) $$
Eigenschaften von $F_{\boldsymbol{x}}(x)$

$\lim _{x \rightarrow-\infty} F_{\boldsymbol{x}}(x)=0$

$\lim _{x \rightarrow\infty} F_{\boldsymbol{x}}(x)=1$

monoton steigend und rechtsseitig stetig.

Bei stetiger Zufallsvariable:
$$ F_{\boldsymbol{x}}(x)=\int_{-\infty}^{x} f_{\boldsymbol{x}}(u) \mathrm{d} u $$
$f_{\boldsymbol{x}}(x)$ heißt Dichte von $x$.

“Dichte” einer diskreten Zufallsvariable:
$$ f_{\boldsymbol{x}}(x)=\sum_{n=1}^{\infty} \mathrm{P}\left(\boldsymbol{x}=x_{n}\right) \delta\left(x-x_{n}\right)=\sum_{n=1}^{\infty} p_{n} \delta\left(x-x_{n}\right) $$
Zufallsvariable

Üb 1, A4, A5

Eine Zufallsvariable ist eine numerische Beschreibung des Ergebnisses eines Zufallsexperiments. Es handelt sich um eine Funktion, die ein Ergebnis $\omega$ aus einem Ergebnisraum $\Omega$ in den Raum $\mathbb{R}$ der reellen Zahlen abbildet
$$ \boldsymbol{x}=\boldsymbol{x}(\omega): \Omega \rightarrow \mathbb{R} $$
Zwei Typen

Diskret: Ergebnisse sind endlich oder höchstens abzählbar unendlich

Kontinuierlich: Ereignis- und Wertemenge ist überabzählbaren.

Momente

Üb 2, A1.1

Erwartungswert (Mittelwert, 1-te Moment) der Zufallsvariablen $\boldsymbol{x}$:
$$ \mathrm{E}_{f_{\boldsymbol{x}}}\{\boldsymbol{x}\}=\hat{\boldsymbol{x}}=\mu_{\boldsymbol{x}}=\int_{-\infty}^{\infty} x f_{\boldsymbol{x}}(x) \mathrm{d} x $$
$k$-te Moment der Zufallsvariablen $\boldsymbol{x}$: $$ \mathrm{E}_{f_{\boldsymbol{x}}}\left\{\boldsymbol{x}^{k}\right\}=\int_{-\infty}^{\infty} x^{k} f_{\boldsymbol{x}}(x) \mathrm{d} x $$

$k$-te zentrale Moment der Zufallsvariablen $\boldsymbol{x}$:
$$ \mathrm{E}_{f_{\boldsymbol{x}}}\left\{\left(\boldsymbol{x}-\mathrm{E}_{f_{\boldsymbol{x}}}\{\boldsymbol{x}\}\right)^{k}\right\}=\int_{-\infty}^{\infty}\left(x-\mu_{\boldsymbol{x}}\right)^{k} f_{\boldsymbol{x}}(x) \mathrm{d} x $$
Varianz (2-te zentral Moment) der Zufallsvariablen $\boldsymbol{x}$:
$$ \mathrm{E}_{f_{\boldsymbol{x}}}\left\{\left(\boldsymbol{x}-\mathrm{E}_{f_{\boldsymbol{x}}}\{\boldsymbol{x}\}\right)^{2}\right\}=\int_{-\infty}^{\infty}\left(x-\mu_{\boldsymbol{x}}\right)^{2} f_{\boldsymbol{x}}(x) \mathrm{d} x $$

$\sigma_{\boldsymbol{x}}$: Standardabweichung der Zufallsvariablen $\boldsymbol{x}$

2-dim. Zufallsvariable

Üb 2, A2.2

$\underline{X}$ sei eine zweidimensionale Zufallsvariable mit der Dichte $f(\underline{X})=f_{\underline{X}}\left(x_{1}, x_{2}\right)$.

Randdichte
$$ \begin{array}{l} f_{X_{1}}\left(x_{1}\right)=\int_{-\infty}^{\infty} f_{\underline{X}}\left(x_{1}, x_{2}\right) \mathrm{d} x_{2} \\ f_{X_{2}}\left(x_{2}\right)=\int_{-\infty}^{\infty} f_{\underline{X}}\left(x_{1}, x_{2}\right) \mathrm{d} x_{1} \end{array} $$
Bedingte Dichte

Bedingte Dichte von $x_1$
$$ f_{X_{1}}\left(x_{1} \mid X_{2}=x_{2}\right)=\frac{f_{\underline{X}}\left(x_{1}, x_{2}\right)}{f_{X_{2}}\left(x_{2}\right)} $$
Bedingte Dichte von $x_2$
$$ f_{X_{2}}\left(x_{2} \mid X_{1}=x_{1}\right)=\frac{f_{\underline{X}}\left(x_{1}, x_{2}\right)}{f_{X_{1}}\left(x_{1}\right)} $$
Unabhängigkeit und Unkorreliertheit von Zufallsvariablen

Üb 2, A2.3

$X, Y$ sind unabhängig $\Leftrightarrow$
$$ f_{X, Y}(x, y)=f_{X}(x) \cdot f_{Y}(y) $$
Damit gilt auch
$$ f_{X}(x \mid Y=y)=f_{X}(x) $$
Die Kovarianz $\sigma_{X, Y}=\operatorname{Cov}_{\boldsymbol{f}_{X, Y}}\{X, Y\}$ von $X$ und $Y$:
$$ \sigma_{X, Y}=\operatorname{Cov}_{f_{X, Y}}\{X, Y\}=\mathrm{E}\{(X-\mathrm{E}\{X\}) \cdot(Y-\mathrm{E}\{Y\})\}=\mathrm{E}\left\{\left(X-\mu_{x}\right) \cdot\left(Y-\mu_{y}\right)\right\} $$
Der Korrelationskoeffizient von $X$ und $Y$:
$$ \rho_{X, Y}=\frac{\sigma_{X, Y}}{\sigma_{X} \cdot \sigma_{Y}} \in [-1, 1] $$

$\left|\rho_{X, Y}\right|=1$: $X$ und $Y$ sind maximal ähnlich

$\left|\rho_{X, Y}\right|=0$: $X$ und $Y$ sind komplett unähnlich (i.e., $X$ und $Y$ sind unkorreliert)

Unabhängigkeit und Unkorreliertheit:
$$ \text{Unabhängigkeit} \underset{\text{+ Normalverteilung}}{\rightleftharpoons} \text{Unkorreliertheit} $$
Erwartungswert

Üb 1, A7

Üb 2, A3.4

Der Erwartungswert kann interpretiert werden als Mittelwert aller möglichen Werte $x_n$, die eine (diskrete) Zufallsvariable $\boldsymbol{x}$ annehmen kann. Dabei werden die Werte entsprechend ihrer Auftretenswahrscheinlichkeit $p_n$ gewichtet.
$$ \mathrm{E}\{\boldsymbol{x}\}=\sum_{n=1}^{N} x_{n} p_{n} $$
Kontinuierlicher Fall:
$$ \mathrm{E}_{f_\boldsymbol{x}}\{\boldsymbol{x}\} = \int_{-\infty}^{\infty}x f_\boldsymbol{x}(x) dx $$
Erwartungswert für Funktionen einer Zufallsvariable:
$$ \mathrm{E}_{f_{\boldsymbol{x}}}\{g(\boldsymbol{x})\}=\int_{-\infty}^{\infty} g(x) f_{\boldsymbol{x}}(x) \mathrm{d} x $$
Recehenregeln:

$\mathrm{E}_{f_{X}}\{a X+b\}=a \mathrm{E}_{f_{X}}\{X\}+b$

$a$ ist eine Konstante: $E(a)=a$

$E(X \pm Y)=E(X) \pm E(Y)$

$E(XY) = E(x) E(Y)$ , falls $x, Y$ unabhängig

Varianz
$$ E_{f_X}\{(X - \mu_X)^2\} = \operatorname{Var}(X) = \sigma_X^2 $$
Rechenregeln:

$\operatorname{Var}_{f_X}\{aX+b\} = a^2 \operatorname{Var}_{f_X}\{X\}$

$\operatorname{Var}_{f_{X}}\{X\}=\mathrm{E}_{f_{X}}\left\{X^{2}\right\}-\left(\mathrm{E}_{f_{X}}\{X\}\right)^{2}$

$a$ is eine Konstante

$\operatorname{Var}_{f_X}\{a\} = 0$

$\operatorname{Var}_{f_X}\{a \pm X\} = \operatorname{Var}_{f_X}\{X\}$

$\operatorname{Var}\{X, Y\} = E\{XY\} - \mu_X \mu_Y $

Kovarianzmatrix

Üb 2, A2.3

Üb 4, A5

$$ \begin{array}{l} \operatorname{Cov}_{f_{\underline{x}}}\{\underline{X}\}=\mathrm{E}_{f_{\underline{\underline{x}}}}\left\{(\underline{X}-\underline{\mu})(\underline{X}-\underline{\mu})^{\top}\right\}\\ \newline =\left[\begin{array}{cccc} \sigma_{X_{1}}^{2} & \sigma_{X_1 X_2} & \cdots & \sigma_{X_1 X_N} \\ \sigma_{X_2 X_1} & \sigma_{X_{2}}^{2} & \cdots & \sigma_{X_2 X_N} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{X_N X_1} & \sigma_{X_N X_2} & \cdots & \sigma_{X_{N}}^{2} \end{array}\right] \newline =\left[\begin{array}{cccc} \sigma_{X_{1}}^{2} & \rho_{X_{1}, X_{2}} \sigma_{X_{1}} \sigma_{X_{2}} & \cdots & \rho_{X_{1}, X_{N}} \sigma_{X_{1}} \sigma_{X_{N}} \\ \rho_{X_{2}, X_{1}} \sigma_{X_{2}} \sigma_{X_{1}} & \sigma_{X_{2}}^{2} & \cdots & \rho_{X_{2}, X_{N}} \sigma_{X_{2}} \sigma_{X_{N}} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{X_{N}, X_{1}} \sigma_{X_{N}} \sigma_{X_{1}} & \rho_{X_{N}, X_{2}} \sigma_{X_{N}} \sigma_{X_{2}} & \cdots & \sigma_{X_{N}}^{2} \end{array}\right] \end{array} $$
Positiv definit, positiv semidefinit

Eine beliebige (ggf. symmetrische bzw. hermitesche) $n \times n$-Matrix $A$ ist

positiv definit, falls
$$ x^T A x > 0 $$

positiv semidefinit, falls
$$ x^T A x \geq 0 $$

Weißes Rauschen

Uncertainties taken at different time steps are also independent

System-Eigenschaften: dynamisch, statisch, linear, zeitinvariant

Statisch: Der aktuellen Ausgang $y_k$ ist abhängig von dem aktuellen Eingang $u_k$.

Dynamisch: Der aktuellen Ausgang $y_k$ ist abhängig von

dem aktuellen Eingang $u_k$

dem aktuellen Zustand $x_k$

Bei wertkontinuierlicher linearer Systeme:

Üb 5, A1

Linear
$$ \mathcal{S}\left\{\sum_{i=1}^{N} c_{i} y_{\mathrm{e} i, n}\right\}=\sum_{i=1}^{N} c_{i} \mathcal{S}\left\{y_{\mathrm{e} i, n}\right\} $$
(also höhste Exponent $\leq 1$)

Zeitinvariant

Das System antwortet auf ein zeitlich verschobenes Eingangssignal $y_{\mathrm{e}, n-n_{0}}$ mit dem entsprechend zeitlichverschobenen Ausgangssignal $y_{\mathrm{a}, n-n_{0}}$
$$ y_{\mathrm{a}, n}=\mathcal{S}\left\{y_{\mathrm{e}, n}\right\} \quad \Longrightarrow \quad y_{\mathrm{a}, n-n_{0}}=\mathcal{S}\left\{y_{\mathrm{e}, n-n_{0}}\right\}. $$
(also unabhängig von dem Zeitindex $k$)

Kausalität

Ein zeitdiskretes System S heißt kausal, wenn die Antwort NUR von gegenwärtigen oder vergangenen, NICHT jedoch von zukünftigen Werten des Eingangssignals abhängt.

Dirac Funktion

Definition:
$$ \begin{aligned} \delta(x)&=0, \quad x \neq 0 \\ \int_{a}^b \delta(x) dx &= 1 \quad a < x < b \end{aligned} $$
Rechenregeln

Verschiebung
$$ \int_{a}^{b} f(x) \delta\left(x-x_{0}\right) \mathrm{d} x=f\left(x_{0}\right) $$

Symmetrie
$$ \delta(x) = \delta(-x) $$

Skalierung
$$ \int_{a}^{b} f(x) \delta(|k| x) \mathrm{d} x=\frac{1}{|k|} f(0) $$

Hintereinanderausführung
$$ \int_{-\infty}^{\infty} f(x) \delta(g(x)) \mathrm{d} x=\sum_{i=1}^{n} \frac{f\left(x_{i}\right)}{\left|g^{\prime}\left(x_{i}\right)\right|} $$
wobei $g(x_i) = 0$ und $g^\prime(x_i) \neq 0$.

Verkettung auflösen (super wichtig!!!)
$$ \delta(g(x)) = \sum_i \frac{1}{g^\prime(x_i)} \delta(x - x_i) $$
wobei $g(x_i) = 0$ und $g^\prime(x_i) \neq 0$.

Dirac Mixture
$$ f(x)=\sum_{i=1}^{L} w_{i} \delta(\underline{x}-\underline{\hat{x}}_i) $$

Wertediskrete Systeme

Sat, 20 Aug 2022 00:00:00 +0000

Wonham Filter

Zustandschätzung für wertediskrete Systeme: Wonham Filter

Prädiktion
$$ \underline{\xi}_{k}^{p}=\mathbf{A}^{\top} \underline{\xi}_{k-1}^{e} $$

Filterung
$$ \underline{\xi}_{k}^{e} \overset{y_k = m}{=} \frac{\mathbf{B}(:, m) \odot \underline{\xi}_{k}^{p}}{\mathbf{B}(:, m)^\top \cdot \underline{\xi}_{k}^{p}} $$

Üb 4, A2

Herleitung

Prädiktion $P(x_k \mid y_{0:m}, u_{0:k-1})$ für $k > m$

nach $x_{k-1}$ marginalisieren

Bayes einsetzen
$$ P(a, b \mid c) = P(a \mid b, c) \cdot P(b \mid c) \qquad (\ast) $$

Markov Eigenschaft verwenden

Filterung: $P\left(x_{k} \mid y_{1: k}, u_{0: k-1}\right)$

$P\left(x_{k} \mid y_{1: k}, u_{0: k-1}\right) = P(x_{k} \mid y_k, y_{1: k-1}, u_{0: k-1})$

Bayes einsetzen
$$ P(b \mid a, c) \cdot P(a \mid c)=P(a \mid b, c) \cdot P(b \mid c) \quad (\triangle) $$

Schreibe in Form $\frac{\text{Likelihood} \cdot \text{Prädiktion}}{\text{Normalisierungskonstant}}$
$$ P\left(x_{k} \mid y_{1: k}, u_{0: k-1}\right) = \frac{\overbrace{P\left(y_{k} \mid x_{k}\right)}^{\text{Likelihood}} \cdot \overbrace{P\left(x_{k} \mid y_{1: k-1}, u_{0: k-1}\right)}^{\text{Einschritt-Prädiktion}}}{\underbrace{P\left(y_{k} \mid y_{1: k-1}, u_{0: k-1}\right)}_{\text{Normalisierungskonstant}}} $$

Likelihood: $P\left(y_{k} \mid x_{k}\right) = \mathbf{B}(x_k, y_k)$

Prädiktion erhalten wir in Prädiktionsschritt

Normalisierungskonstant

Marginalisierung nach $x_k$

Bayes einsetzen
$$ P(a, b \mid c) = P(a \mid b, c) \cdot P(b \mid c) \qquad (\ast) $$

Komplexitätsproblem bei der Diskretisierung eines allgemeinen Systems

Riesiger Speicherbedarf von Wahrscheinlichkeitsvektor und Transitionsmatrix

Wertekontinuierliche lineare Systeme

Mon, 22 Aug 2022 00:00:00 +0000

Kalman Filter

Prädiktion
$$ \underline{\hat{x}}_k^p = \mathbf{A}_{k-1}\underline{\hat{x}}_{k-1}^e + \mathbf{B}_{k-1} \underline{\hat{u}}_{k-1} $$ $$ \mathbf{C}_k^p = \mathbf{A}_{k-1} \mathbf{C}_{k-1}^e A_{k-1}^\top + \mathbf{B}_{k-1} \mathbf{C}_{k-1}^w \mathbf{B}_{k-1}^\top $$
Filterung
$$ \mathbf{K}_k = \mathbf{C}_k^p \mathbf{H}_k^\top (\mathbf{C}_k^v + \mathbf{H}_k \mathbf{C}_k^p \mathbf{H}_k ^\top)^{-1} \tag{Kalman Gain} $$ $$ \underline{\hat{x}}_k^e = (\mathbf{I} - \mathbf{K}_k \mathbf{H}_k) \underline{\hat{x}}_k^p + \mathbf{K}_k \underline{\hat{y}}_k = \underline{\hat{x}}_k^p + \mathbf{K}_k(\underline{\hat{y}}_k - \mathbf{H}_k \underline{\hat{x}}_k^p) $$ $$ \mathbf{C}_k^e = (\mathbf{I} - \mathbf{K}_k\mathbf{H}_k)\mathbf{C}_k^p = \mathbf{C}_k^p - \mathbf{C}_k^p \mathbf{H}_k^\top (\mathbf{C}_k^v + \mathbf{H}_k \mathbf{C}_k^p \mathbf{H}_k ^\top)^{-1}\mathbf{H}_k \mathbf{C}_k^p $$
Kalman Filter (vektoriell) herleiten

Prädiktion

Systemabbildung
$$ \underline{x}_{k+1}=\mathbf{A}_{k} \cdot \underline{x}_{k}+\mathbf{B}_{k} \cdot \underbrace{\left(\underline{\tilde{u}}_{k}+\underline{w}_{k}\right)}_{\underline{u_k}} $$
Schritte

Berechnung des Erwartungswerts für $k+1$
$$ E\left\{\underline{x}_{k+1}\right\}=\mathbf{A}_{k} \cdot \underline{\hat{x}}_{k|1: m}+\mathbf{B}_{k} \tilde{\underline{u}}_{k} \qquad (+) $$

Berechnung der Kovarianzmatrix $C_{k+1|1:m}^x$ mit der Annahme, dass Zustand und Systemrauschen unkorreliert sind
$$ \begin{aligned} \underline{x}_{k+1} &=\mathbf{A}_{k} \underline{x}_{k}+\mathbf{B}_{k} \underline{u}_{k} \\ &=\left[\begin{array}{ll} \mathbf{A}_{k} & \mathbf{B}_{k} \end{array}\right]\left[\begin{array}{c} \underline{x}_{k} \\ \underline{u}_{k} \end{array}\right] \end{aligned} $$

Berechne $\operatorname{Cov}\left\{\left[\begin{array}{c} \underline{x}_{k} \\ \underline{\tilde{u}}_{k} \end{array}\right]\right\}$
$$ \begin{aligned} \underline{x}_{k+1}-\hat{\underline{x}}_{k+1} &=\left[\begin{array}{ll} \mathbf{A}_{k} & \mathbf{B}_{k} \end{array}\right]\left[\begin{array}{c} \underline{x}_{k}-\hat{\underline{x}}_{k} \\ \underline{u}_{k}-\underline{\hat{u}}_{k} \end{array}\right] \\ &=\left[\begin{array}{ll} \mathbf{A}_{k} & \mathbf{B}_{k} \end{array}\right]\left[\begin{array}{c} \underline{x}_{k}-\underline{\hat{x}}_{k} \\ \underline{w}_{k} \end{array}\right] \end{aligned} $$ $$ \begin{aligned} \operatorname{Cov}\left\{\left[\begin{array}{c} \underline{x}_{k} \\ \underline{\tilde{u}}_{k} \end{array}\right]\right\} &=E\left\{\left[\begin{array}{c} \underline{x}_{k}-\underline{\hat{x}}_{k} \\ \underline{w}_{k} \end{array}\right]\left[\left(\underline{x}_{k}-\underline{\hat{x}}_{k}\right)^{\top} \underline{w}_{k}^{\top}\right]\right\} \\ &=\left[\begin{array}{cc} C_{k \mid 1: m}^{x} & 0 \\ 0 & C_{k}^{w} \end{array}\right] \end{aligned} $$

$\operatorname{Cov}\left\{\left[\begin{array}{c} \underline{x}_{k} \\ \underline{\tilde{u}}_{k} \end{array}\right]\right\}$ in Berechnung von $C_{k+1|1:m}^x$ einsetzen
$$ \begin{aligned} \mathbf{C}_{k+1 \mid 1 : m}^{x} &=E\left\{\left(\underline{x}_{k+1}-\hat{x}_{k+1}\right)\left(x_{k+1} - \hat{\underline{x}}_{k+1}\right)^\top\right\} \\ &=\left[\begin{array}{ll} \mathbf{A}_{k} & \mathbf{B}_{k} \end{array}\right] \cdot E\left\{\left[\begin{array}{c} \underline{x}_{k}-\hat{\underline{x}}_{k} \\ \underline{w}_{k} \end{array}\right]\left[\begin{array}{ll} \underline{x}_{k}-\hat{\underline{x}}_{k} & \underline{w}_{k} \end{array}\right]^\top\right\} \cdot\left[\begin{array}{l} \mathbf{A}_{k}^{\top} \\ \mathbf{B}_{k}^{\top} \end{array}\right] \\\\ &=\left[\begin{array}{ll} \mathbf{A}_{k} & \mathbf{B}_{k} \end{array}\right] \cdot\left[\begin{array}{cc} \mathbf{C}_{k \mid 1:m} & 0 \\ 0 & \mathbf{C}_{k}^{w} \end{array}\right] \cdot\left[\begin{array}{l} \mathbf{A}_{k}^{\top} \\ \mathbf{B}_{k}^{\top} \end{array}\right] \\ &=\mathbf{A}_{k} \cdot \mathbf{C}_{k \mid 1: m}^{x} \mathbf{A}_{k}^{\top}+\mathbf{B}_{k} \mathbf{C}_{k}^{w} \mathbf{B}_{k}^{\top} \qquad(++) \end{aligned} $$

Filterung

Messabbildung
$$ \underline{y}_{k}=\mathbf{H}_{k} \cdot \underline{x}_{k}+\underline{v}_{k} $$
Schritte

Schreibe $\underline{x}_k^e$ als lineare Kombination von $\underline{x}_k^p$ und $\underline{y}_k$
$$ \underline{x}_{k}^e=\mathbf{K}_{k}^{(1)} \underline{x}_{k}^p+\mathbf{K}_{k}^{(2)} \underline{y}_{k} $$

Aus BLUE Filter ergibt sich
$$ E\{\underline{x}_{k}^e\}=E\{\mathbf{K}_{k}^{(1)} \underline{x}_{k}^p+\mathbf{K}_{k}^{(2)} \underline{y}_{k}\} $$
$\Rightarrow$
$$ \begin{aligned} \mathbf{K}_{k}^{(1)} &= \mathbf{I} - \mathbf{K}_{k}\mathbf{H}_{k} \\ \mathbf{K}_{k}^{(2)} &= \mathbf{K}_{k} \end{aligned} $$
und
$$ \underline{x}_{k}^e=(\mathbf{I} - \mathbf{K}_{k}\mathbf{H}_{k}) \underline{x}_{k}^p+\mathbf{K}_{k} \underline{y}_{k} $$

Berechne Kovarianzmatrix $\mathbf{C}_k^e$
$$ \mathbf{C}_{k}^{e}\left(\mathbf{K}_{k}\right)=\operatorname{Cov}\{\underline{x}_k^e - \underline{x}\} = \left(\mathbf{I}-\mathbf{K}_{k} \mathbf{H}_{k}\right) \mathbf{C}_{k}^{p}\left(\mathbf{I}-\mathbf{K}_{k} \mathbf{H}_{k}\right)^{\top}+\mathbf{K}_{k} C_{k}^{v} \mathbf{K}_{k}^{\top} $$
Wir suche $\mathbf{K}_{k}$ so, dass der resultierende Schätzer MINIMAL kovarianz aufweist.

Auf skalares Gütemaß zurückzuführen
$$ P(\mathbf{K}_{k}) = \underline{e}^\top \left( \left(\mathbf{I}-\mathbf{K}_{k} \mathbf{H}_{k}\right) \mathbf{C}_{k}^{p}\left(\mathbf{I}-\mathbf{K}_{k} \mathbf{H}_{k}\right)^{\top}+\mathbf{K}_{k} C_{k}^{v} \mathbf{K}_{k}^{\top}\right) \underline{e} $$

$\frac{\partial}{\partial \mathbf{K}_{k}} P(\mathbf{K}_{k})\overset{!}{=} 0 \Rightarrow$ $$ \mathbf{K}_k = \mathbf{C}_k^p \mathbf{H}_k^\top (\mathbf{C}_k^v + \mathbf{H}_k \mathbf{C}_k^p \mathbf{H}_k^\top)^{-1} $$

$\mathbf{K}_k$ in $\underline{x}_{k}^e$ und $\mathbf{C}_{k}^{e}$ einsetzen

Ergebnis von “Gauß mal Gauß”

Drei Gütemaße für die „Größe“ einer Kovarianzmatrix

Mögliche Gütemaße für generelles Vergleichen von Kovarianzmatrizen:
$$ f: \mathbb{R}^{n \times n} \to \mathbb{R}^1 $$
Funktion, die einer Kovarianzmatrix einen Skalar zuordnen kann, denn man kann nur Skalare direkt miteinander vergleichen.

Drei Gütemaße

Projektion mit Einheitsvektor

Spur

Determinante

Schwach nichtlineare wertekontinuierliche Systeme

Wed, 24 Aug 2022 00:00:00 +0000

Lineare Vs. Nichtlineare Systeme

Linear Nichtlinear

Systemabbildung $\underline{x}_{k+1} = \mathbf{A}_k \underline{x}_k + \mathbf{B}_k (\underline{u}_k + \underline{w}_k)$ $\underline{x}_{k+1} = \underline{a}_k(\underline{x}_k, \underline{u}_k, \underline{w}_k)$

Messabbildung $\underline{y}_{k} = \mathbf{H}_k \underline{x}_k + \underline{v}_k$ $\underline{y}_k = \underline{h}_k (\underline{x}_k, \underline{v}_k)$

Extended Kalman Filter (EKF)

💡 Idee: Linearisierung mit Tylorentwicklung 1. Ordnung um die beste verfügbare Schätzung, um den (linear) Kalman-Filter zu vewenden.

Systemabbildung
$$ \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right) \approx \underbrace{\underline{a}_{k}\left(\underline{\overline{x}}_k, \underline{\overline{u}}_k\right)}_{\text{Nomialteil}}+\underbrace{\mathbf{A}_{k}\left(\underline{x}_k-\underline{\overline{x}}_k\right)+\mathbf{B}_{k}\left(\underline{u}_{k}-\underline{\overline{u}}_k\right)}_{\text{Differentialteil}} $$

Messabbildung
$$ \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right) \approx \underbrace{\underline{h}_{k}\left(\underline{\bar{x}}_{k}, \underline{\bar{v}}_{k}\right)}_{\text{Nomialteil}}+ \underbrace{\mathbf{H}_{k} \cdot \left(\underline{x}_{k}-\underline{\bar{x}}_{k}\right)+\mathbf{L}_{k} \cdot\left(\underline{v}_{k}-\underline{\bar{v}}_{k}\right)}_{\text{Differentialteil}} $$

Üb 7, A2

Prädiktion

Berechnung Erwartungswert über nichtlineare Funktion
$$ \underline{\hat{x}}_{k+1}^{p}=\underline{a}_{k}\left(\underline{\hat{x}}_{k}^{e}, \hat{\underline{u}}_{k}\right) $$

Berechnung Kovarianzmatrix über die Linearisierung
$$ \mathbf{C}_{k+1}^{p} \approx \mathbf{A}_{k} \mathbf{C}_{k}^{e} \mathbf{A}_{k}^{\top}+\mathbf{C}_{k}^{w^{\prime}}=\mathbf{A}_{k} \mathbf{C}_{k}^{e} \mathbf{A}_{k}^{\top}+\mathbf{B}_{k} \mathbf{C}_{k}^{w} \mathbf{B}_{k}^{\top} $$
mit
$$ \mathbf{A}_k = \left.\frac{\partial \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)}{\partial \underline{x}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k-1}^{e}, \underline{u}_{k}=\hat{\underline{u}}_{k}} \qquad \mathbf{B}_k = \left.\frac{\partial \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)}{\partial \underline{u}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k-1}^{e}, \underline{u}_{k}=\hat{\underline{u}}_{k}} $$

Filterung

Linearisierung um $\underline{x}_k$ und $\underline{v}_k$
$$ \mathbf{H}_{k}=\left.\frac{\partial \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right)}{\partial \underline{x}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k}^{p}, \underline{v}_{k}=\underline{\hat{v}}_{k}} \qquad \mathbf{L}_{k}=\left.\frac{\partial \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right)}{\partial \underline{v}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k}^{p}, \underline{v}_{k}=\underline{\hat{v}}_{k}} $$

KF Filterung schriit mit Linearisierung
$$ \begin{aligned} \mathbf{K}_{k}&=\mathbf{C}_{k}^{p} \mathbf{H}_{k}^{\top}\left(\mathbf{L}_{k} \mathbf{C}_{k}^{v} \mathbf{L}_{k}^{\top}+\mathbf{H}_{k} \mathbf{C}_{k}^{p} \mathbf{H}_{k}^{T}\right)^{-1} \\\\ \hat{\underline{x}}_{k}^{e}&=\hat{\underline{x}}_{k}^{p}+\mathbf{K}_{k}\left[\hat{\underline{y}}_{k}-\underline{h}_{k}\left(\hat{\underline{x}}_{k}^{p}, \hat{\underline{v}}_{k}\right)\right] \overset{\underline{v} \text{ mittelwertfrei}}{=} \hat{\underline{x}}_{k}^{p}+\mathbf{K}_{k}\left[\hat{\underline{y}}_{k}-\underline{h}_{k}\left(\hat{\underline{x}}_{k}^{p}, 0\right)\right]\\\\ \mathbf{C}_{k}^{e}&=\mathbf{C}_{k}^{p}-\mathbf{K}_{k} \mathbf{H}_{k} \mathbf{C}_{k}^{p} = (\mathbf{I} - \mathbf{K}_{k} \mathbf{H}_{k})\mathbf{C}_{k}^{p} \end{aligned} $$

(Linear) KF vs. EKF

(Linear) KF EKF

Prädiktion $\underline{\hat{x}}_k^p = \mathbf{A}_{k-1}\underline{\hat{x}}_{k-1}^e + \mathbf{B}_{k-1} \underline{\hat{u}}_{k-1}$
$\mathbf{C}_k^p = \mathbf{A}_{k-1} \mathbf{C}_{k-1}^e A_{k-1}^\top + \mathbf{B}_{k-1} \mathbf{C}_{k-1}^w \mathbf{B}_{k-1}^\top$ $\underline{\hat{x}}_{k+1}^{p}=\underline{a}_{k}\left(\underline{\hat{x}}_{k}^{e}, \hat{\underline{u}}_{k}\right)$
$\mathbf{C}_{k+1}^{p} \approx \mathbf{A}_{k} \mathbf{C}_{k}^{e} \mathbf{A}_{k}^{\top}+\mathbf{C}_{k}^{w^{\prime}}=\mathbf{A}_{k} \mathbf{C}_{k}^{e} \mathbf{A}_{k}^{\top}+\mathbf{B}_{k} \mathbf{C}_{k}^{w} \mathbf{B}_{k}^{\top}$

Filterung $\mathbf{K}_k = \mathbf{C}_k^p \mathbf{H}_k^\top (\mathbf{C}_k^v + \mathbf{H}_k \mathbf{C}_k^p \mathbf{H}_k ^\top)^{-1}$
$\underline{\hat{x}}_k^e = \underline{\hat{x}}_k^p + \mathbf{K}_k(\underline{\hat{y}}_k - \mathbf{H}_k \underline{\hat{x}}_k^p)$
$\mathbf{C}_k^e = (\mathbf{I} - \mathbf{K}_k\mathbf{H}_k)\mathbf{C}_k^p$ $\begin{aligned}
\mathbf{K}_{k}&=\mathbf{C}_{k}^{p} \mathbf{H}_{k}^{\top}\left(\mathbf{L}_{k} \mathbf{C}_{k}^{v} \mathbf{L}_{k}^{\top}+\mathbf{H}_{k} \mathbf{C}_{k}^{p} \mathbf{H}_{k}^{T}\right)^{-1} \\
\hat{\underline{x}}_{k}^{e}&=\hat{\underline{x}}_{k}^{p}+\mathbf{K}_{k}\left[\hat{\underline{y}}_{k}-\underline{h}_{k}\left(\hat{\underline{x}}_{k}^{p}, \hat{\underline{v}}_{k}\right)\right] \\
\mathbf{C}_{k}^{e}&=\mathbf{C}_{k}^{p}-\mathbf{K}_{k} \mathbf{H}_{k} \mathbf{C}_{k}^{p} = (\mathbf{I} - \mathbf{K}_{k} \mathbf{H}_{k})\mathbf{C}_{k}^{p}
\end{aligned}$

Auxiliary $\mathbf{A}_k = \left.\frac{\partial \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)}{\partial \underline{x}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k-1}^{e}, \underline{u}_{k}=\hat{\underline{u}}_{k}}$
$\mathbf{B}_k = \left.\frac{\partial \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)}{\partial \underline{u}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k-1}^{e}, \underline{u}_{k}=\hat{\underline{u}}_{k}}$
$\mathbf{H}_{k}=\left.\frac{\partial \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right)}{\partial \underline{x}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k}^{p}, \underline{v}_{k}=\underline{\hat{v}}_{k}}$
$\mathbf{L}_{k}=\left.\frac{\partial \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right)}{\partial \underline{v}_{k}^{\top}}\right|_{\underline{x}_{k}=\underline{\hat{x}}_{k}^{p}, \underline{v}_{k}=\underline{\hat{v}}_{k}}$

Probleme

Berechnung der posteriore Verteilung nur gut für “schwache” Nichtlinearität

Linearisierung nur um einen Punkt

Linearisiertes System ist i.A. zeitvariant, auch wenn originalsytstem zeitinvariant ist, da Linearisierung vom Schätzwert abhängt.

Kalman Filter in probabilistischer Form

Filterung

(Annahme: $\underline{x}_k$ und $\underline{y}_k$ sind gemeinsam Gaußverteilt)

Define $\underline{z}:=\left[\begin{array}{l} \underline{x} \\ \underline{y} \end{array}\right]$

Mittelwert und Varianz von $\underline{z}$ berechnen.
$$ \underline{\mu}_z=\left[\begin{array}{l} \underline{\mu}_x \\ \underline{\mu}_y \end{array}\right]=\frac{1}{L}\sum_{i=1}^L\left[\begin{array}{l} \underline{x}_i \\ \underline{y}_i \end{array}\right], \quad \mathbf{C}_{z} = \frac{1}{L}\sum_{i=1}^L(\underline{z}_i - \underline{\mu}_z)(\underline{z}_i - \underline{\mu}_z)^\top = \left[\begin{array}{ll} \mathbf{C}_{x x} & \mathbf{C}_{x y} \\ \mathbf{C}_{y x} & \mathbf{C}_{y y} \end{array}\right] $$

Filterung in probabilistischer Form mit Messung $\hat{\underline{y}}$
$$ \begin{aligned} \underline{\hat{x}}_k^e &= \underline{x}_k^p + \mathbf{C}_{xy} \mathbf{C}_{yy}^{-1} (\underline{\hat{y}} - \underline{\mu}_y) \\ \mathbf{C}_k^e &= \mathbf{C}_k^p - \mathbf{C}_{xy} \mathbf{C}_{yy}^{-1} \mathbf{C}_{yx} \end{aligned} $$

Unscented Kalman Filter (UKF)

Üb 7, A3

Unscented Prinzipien

Nichtlineare Transformation eines einzelnen Punktes ist einfach

Es ist einfach, eine Punktwolke zu finden, deren Stichprobenmittelwert und -varianz mit den Momenten der gegebene Dichte übereinstimmen.

Es ist einfach, Mittelwert und Varianz einer Punktwolke zu bestimmen

Bsp: additives Rauschen
$$ \begin{aligned} \underline{x}_{k+1} &= \underline{a}_{k}(\underline{x}_{k}) + \underline{w}_{k} \\ \underline{y}_{k} &= \underline{h}_{k}(\underline{x}_{k}) + \underline{v}_{k} \end{aligned} $$
Prädiktion

Samples/Particles/Punkte propagieren
$$ \underline{x}_{k}^{p, i} = \underline{a}_{k-1}(\underline{x}_{k-1}^{e, i}) $$

Mittelwert und Varianz basierend auf Samples berechnen
$$ \begin{aligned} \underline{\hat{x}}_{k}^p &= \frac{1}{L} \sum_{i=1}^L \underline{x}_{k}^{p, i} \\ \mathbf{C}_k^p &= \frac{1}{L} \sum_{i=1}^L (\underline{x}_{k}^{p, i} - \underline{\hat{x}}_{k}^p) (\underline{x}_{k}^{p, i} - \underline{\hat{x}}_{k}^p)^\top + \mathbf{C}_k^w \end{aligned} $$

Fitlerung

Sampling:

Für prioren Schätzwert: $2N$ btw. $2N+1$ Samples auf Hauptachsen für Dimension $N$

Bsp: Im skalaren Fall ($N=1$), 2 Samples:
$$ > x_1 = \mu_p + \sigma_p \quad x_2 = \mu_p - \sigma_p > $$

Ähnlich für Samples vom Mess-Rauschen

Bsp: Im skalaren Fall ($N=1$), 2 Samples:
$$ > v_1 = \mu_v + \sigma_v \quad v_2 = \mu_v - \sigma_v > $$

Punkte Propagation
$$ \underline{y}_{k}^{p, i} = \underline{h}_{k}(\underline{x}_{k}^{p, i}) $$
bzw.
$$ \underline{y}_{k}^{i, j} = \underline{h}_{k}(\underline{x}_{k}^{p, i}, \underline{v}_k^j) $$

Verbundraum $\underline{z}=\left[\begin{array}{l} \underline{x} \\ \underline{y} \end{array}\right]$ erstellen (Annahme: $\underline{x}_k$ und $\underline{y}_k$ sind gemeinsam Gaußverteilt). Mittelwert und Varianz von $\underline{z}$ berechnen.
$$ \underline{\mu}_z=\left[\begin{array}{l} \underline{\mu}_x \\ \underline{\mu}_y \end{array}\right]=\frac{1}{L}\sum_{i=1}^L\left[\begin{array}{l} \underline{x}_i \\ \underline{y}_i \end{array}\right], \quad \mathbf{C}_{z} = \frac{1}{L}\sum_{i=1}^L(\underline{z}_i - \underline{\mu}_z)(\underline{z}_i - \underline{\mu}_z)^\top = \left[\begin{array}{ll} \mathbf{C}_{x x} & \mathbf{C}_{x y} \\ \mathbf{C}_{y x} & \mathbf{C}_{y y} \end{array}\right] $$

Filterung in probabilistischer Form mit Messung $\hat{\underline{y}}$
$$ \begin{aligned} \underline{\hat{x}}_k^e &= \underline{x}_k^p + \mathbf{C}_{xy} \mathbf{C}_{yy}^{-1} (\underline{\hat{y}} - \underline{\mu}_y) \\ \mathbf{C}_k^e &= \mathbf{C}_k^p - \mathbf{C}_{xy} \mathbf{C}_{yy}^{-1} \mathbf{C}_{yx} \end{aligned} $$

Sampling

Samples nur auf Hauptachsen: Insgesamt $2N$ btw. $2N+1$ ($N$: #Dimensionen)

Vorteil von UKF gegen EKF

UKF reduziert möglicherweise den Linearisierungsfehler des EKF

Man braucht die Jacobi-Matrizen nicht zu berechnen 👏

Analytische Momente

Üb 7, A4

Verbundraum $\underline{z}$ erstellen
$$ z := \left[\begin{array}{l} x \\ y \end{array}\right] $$

Mittelwert von $\underline{z}$ berechnen (mithilfe von höheren Momente der Gaußdichte)
$$ E\{\underline{z}\}=\left[\begin{array}{c} \hat{x}_{p} \\ E\{h(x)\} \end{array}\right] $$

Differenz zwichen $h(x)$ und $E\\{h(x)\\}$ berechnen
$$ \bar{h}(x)=h(x)-E\{h(x)\} $$

$\operatorname{Cov}\{\underline{z}\}$ berechnen
$$ \operatorname{Cov}\{\underline{z}\}=\left[\begin{array}{ll} \mathbf{C}_{x x} & \mathbf{C}_{x y} \\ \mathbf{C}_{y x} & \mathbf{C}_{y y} \end{array}\right]=\left[\begin{array}{cc} \sigma_{p}^{2} & E\left\{\left(x-\hat{x}_{p}\right) \bar{h}(x)\right\} \\ E\left\{\left(x-\hat{x}_{p}\right) \bar{h}(x)\right\} & E\left\{\overline{h}^{2}(x)\right\}+\sigma_{v}^{2} \end{array}\right] $$

Filterung in probabilistischer Form.
$$ \begin{aligned} \underline{\hat{x}}_k^e &= \underline{x}_k^p + \mathbf{C}_{xy} \mathbf{C}_{yy}^{-1} (\underline{\hat{y}} - \underline{\mu}_y) \\ \mathbf{C}_k^e &= \mathbf{C}_k^p - \mathbf{C}_{xy} \mathbf{C}_{yy}^{-1} \mathbf{C}_{yx} \end{aligned} $$

Ensemble Kalman Filter (EnKF)

Üb 6, A4 (f)

💡 Repräsentiere den unsicheren Schätzwert nun per „Streuungsbreite“ einer Punktwolke.

Als „unsicheren Zustand“ verwende $L$ $N$-dim. Vektoren als Samples
$$ \mathcal{X}_{k}=[\underbrace{\underline{x}_{k, 1}}_{\mathbb{R}^N}, \underline{x}_{k, 2}, \ldots, \underline{x}_{k, L}] \in \mathbb{R}^{N \times L}, \quad \mathcal{W}_{k}=\left[\underline{w}_{k, 1}, \underline{w}_{k, 2}, \ldots, \underline{w}_{k, L}\right] \in \mathbb{R}^{N \times L} $$
wobei die Samples als Spalten einer Matrix kompakt aufgefasst werden können.

Prädiktion

Nichtlinear
$$ \mathcal{X}_{k}^p = \underline{a}_{k-1}(\mathcal{X}_{k-1}^e, \underline{u}_{k-1}, \mathcal{W}_{k-1}) $$

Linear
$$ \mathcal{X}_{k}^p = \mathbf{A}_{k-1}\mathcal{X}_{k-1}^e + \mathbf{B}_{k-1}(\underline{u}_{k-1} + \mathcal{W}_{k-1}) $$

Filterung

Durchführung der Filterschritt NUR mit Samples

Vermeidung der Verwendung der Update-Formeln für Kovarianzmatrix (Reine Representation der Unsicherheiten durch Samples)

Schritte

„Prädizierte“ Mess-Samples berechnen

linear
$$ \mathcal{Y}_k = \mathbf{H}_k \mathcal{X}_{k}^p + \mathcal{V}_{k} $$

nichtlinear
$$ \mathcal{Y}_k = \underline{h}_k (\mathcal{X}_{k}^p, \mathcal{V}_{k}) $$

Kalman Gain berechnen
$$ \begin{aligned} \mathbf{C}_{x y} &=\frac{1}{L} \sum_{i=1}^{L} \underline{x}_{k, i}^{\mathrm{p}} \cdot \underline{y}_{k, i}^{\top} \\ &=\frac{1}{L} \mathcal{X}_{k}^{\mathrm{p}} \cdot \mathcal{Y}_{k}^{\top} \\\\ \mathbf{C}_{y y} &=\frac{1}{L} \sum_{i=1}^{L} \underline{y}_{k, i} \cdot \underline{y}_{k, i}^{\top} \\ &=\frac{1}{L} \mathcal{Y}_{k} \cdot \mathcal{Y}_{k}^{\top} \\\\ \mathbf{K} &=\mathbf{C}_{x y} \cdot \mathbf{C}_{y y}^{-1} \end{aligned} $$

Filterschritt mit der tatsächlichen Messung $\underline{\hat{y}}_k$
$$ \mathcal{X}_{k}^e = \mathcal{X}_{k}^p + \mathbf{K} (\underline{\hat{y}}_k \cdot \underline{\mathbb{1}}^\top - \mathcal{Y}_k) $$

Allgemeine Systeme

Thu, 25 Aug 2022 00:00:00 +0000

Generatives und probabilistisches Modell

Für Herleitung ist es super wichtig, die Eigenschaft der Dirac’schen Funktion anzuwenden:
$$ \delta (g(x)) = \sum_{i=1}^N \frac{1}{|g^\prime(x_i)|}\delta (x - x_i) $$

$g(x_i) = 0$

$g^\prime(x_i) \neq 0$

Mit Additivem Rauschen

Generatives Modell:
$$ z = a(x) + v \quad v \sim f_v(v) $$
Probabilistisches Modell:
$$ f(z \mid x) = f_v(z - a(x)) $$
Mit Multiplikativem Rauschen

Generatives Modell:
$$ z = x \cdot v \quad v \sim f_v(v) $$
Probabilistisches Modell:
$$ f(z \mid x) = \frac{1}{|x|}f_v(\frac{z}{x}) $$
Warum lässt sich das nur bei bestimmten Modellen exakt lösen?

“For the general generative model, where the noise enters the system in an arbitrary fashion.” (Script P149)

Abstraktion

Prädiktion (Vorwärtsinferenz)

Üb9 A2, A3

Gegeben

$f_a(a)$

$g(a)$

Gesucht: $f_b(b)$

Chapman-Kolmogorov-Gleichung

Üb A10.1
$$ f_{k+1}^{p}\left(\underline{x}_{k+1}\right)=\int_{\mathbb{R}^{N}} \underbrace{f\left(\underline{x}_{k+1} \mid \underline{x}_{k}\right)}_{\text{Prädiktionsdichte}} f_{k}^{e}\left(\underline{x}_{k}\right) \mathrm{d} \underline{x}_{k} $$
Herleitung ist ganz simple: Verbunddichte + Marginalisierung
$$ f\left(x_{k+1}\right)= \int_{\mathbb{R}^{N}} f\left(\underline{x}_{k+1}, \underline{x}_{k}\right) d \underline{x}_{k}= \int_{\mathbb{R}^{N}} f\left(\underline{x}_{k+1} \mid \underline{x}_{k}\right) \cdot f\left(\underline{x}_{k}\right) d \underline{x}_{k} $$
‼️ Problem: Parameterintegral

Integrand hängt von $\underline{x}_{k+1}$ ab (lässt sich i.Allg nicht herausziehen)

Erfordert Lösung des Integrals für alle $\underline{x}_{k+1}$

Nur möglich für analytische Lösung

Prädiktionsschritte

Umforme $f(b \mid a) = \delta(b - g(a))$ mit
$$ \delta (g(x)) = \sum_{i=1}^N \frac{1}{|g^\prime(x_i)|}\delta (x - x_i) $$
wobei

$g(x_i) = 0$ (also $x_i$ sind Nullstellen, $i = 1, 2, \dots, N$)

$g^\prime(x_i) \neq 0$

Berechne $f_b(b)$ mithilfe von Chapman-Kolmogorov-Gleichung
$$ f(b) = \int f(b \mid a) f(a) da $$
und setze die Unformung von $f(b \mid a)$ von Schritt 1 ein. Dann kriege die gesuchte Dichtefunktion $f_b(b)$ in Abhängigkeit von $f_a(a)$.

Vereinfachte Prädiktion

Für
$$ \underline{z} = \underline{a}(\underline{x}, \underline{w}) $$
ist die Transitionsdichte $f(\underline{z} | \underline{x})$ durch Mixture approximierbar
$$ f(\underline{z} | \underline{x}) = \sum_{i \in \mathbb{Z}} f_i^z(\underline{z}) \cdot f_i^x(\underline{x}) $$
wobei $f_i^z(\underline{z})$ und $f_i^x(\underline{x})$ beliebige Dichte (z.B Gaußdichte) sein können.

Schreibweise mit $\underline{x}_{k+1}$ und $\underline{x}_{k}$ :
$$ f\left(\underline{x}_{k+1} \mid \underline{x}_k\right)=\sum_{i=1}^L w_k^{(i)} f_{k+1}^{(i)}\left(\underline{x}_{k+1}\right) f_k^{(i)}\left(\underline{x}_k\right) $$
Filterung

Rückwartsinferenz

Bei Rückwartsinferenz ist es wichtig, Formel von Bayes anwuwenden.
$$ f(a \mid b) = \frac{f(a, b)}{f(b)} = \frac{f(b \mid a) f(a)}{f(b)} = \underbrace{\frac{1}{f(b)}}_{\text{Normalizationskonstant}} \cdot \underbrace{f(b \mid a)}_{\text{Likelihood}} \cdot \underbrace{f(a)}_{\text{Vorwissen}} $$
Konkrete Messung

Üb9 A2, A3

Umforme $f_b(b \mid a) = \delta(b - g(a))$ mit
$$ \delta (g(x)) = \sum_{i=1}^N \frac{1}{|g^\prime(x_i)|}\delta (x - x_i) $$
wobei

$g(x_i) = 0$ (also $x_i$ sind Nullstellen, $i = 1, 2, \dots, N$)

$g^\prime(x_i) \neq 0$

Berechne $f_b(b)$
$$ f_b(b) = \int f_{a, b}(a, b) da = \int f_{b}(b \mid a) f_a(a) da $$
mit Einsetzen der Unformung von $f(b \mid a)$ von Schritt 1 ein

Berechne $f_a(a \mid \hat{b})$ mithilfe von Bayes Regeln
$$ f_a(a \mid \hat{b}) = \frac{f_a(\hat{b} \mid a) f_a(a)}{f_b(\hat{b})} = \frac{\overbrace{\delta(\hat{b} - g(a))}^{\text{Schritt 1}} f_a(a)}{\underbrace{f_b(\hat{b})}_{\text{Schritt 2}}} $$

Unsichere Messung

Üb A9.4

Schritte:

Erweitere das System um eine zusätzliche stochastische Abbildung und einen festen Ausgang $\hat{z}$

Bestimme $f(\hat{z} \mid y)$
$$ \begin{aligned} f(\hat{z} \mid y) &= \frac{f(y \mid \hat{z})f(\hat{z})}{f(y)} \\\\ &= \frac{f(y \mid \hat{z})f(\hat{z})}{\int f(y, x) dx} \\\\ &= \frac{f(y \mid \hat{z})f(\hat{z})}{\int f(y|x)f(x) dx} \\\\ &= \frac{f(y \mid \hat{z})f(\hat{z})}{\int \delta(y - g(x)) f(x) dx} \\\\ \end{aligned} $$
Und setze die Umformung von $\delta(y - g(x))$
$$ \delta (g(x)) = \sum_{i=1}^N \frac{1}{|g^\prime(x_i)|}\delta (x - x_i) $$

$g(x_i) = 0$ (also $x_i$ sind Nullstellen, $i = 1, 2, \dots, N$)

$g^\prime(x_i) \neq 0$

ein.

Berechung der Rückwärtsinferenz $f(x \mid \hat{z})$
$$ \begin{aligned} f(x \mid \hat{z}) &=\frac{1}{f\left(\hat{z}\right)} \cdot f(x, \hat{z}) \quad \mid \text{Marginalisierung nach } y\\ &=\frac{1}{f(\hat{z})} \int f(x, y, \hat{z}) d y \\ &=\frac{1}{f(\hat{z})} \int f(\hat{z} \mid y, x) \cdot f(y , x) d y \quad \mid \hat{z}, x \text{ sind unabhängig}\\ &=\frac{1}{f(\hat{z})} \int f(\hat{z} \mid y) \cdot f(y \mid x) \cdot f(x) d y \\ &=\frac{1}{f(\hat{z})} \int \underbrace{f(\hat{z} \mid y)}_{\text{Berechnet in Schritt 1}} \cdot \underbrace{f(y \mid x)}_{\text{Systemmodell}} \cdot f(x) d y \end{aligned} $$

Schwierigkeit vom Filterschritt

Type der Dichte zur Beschreibung der Schätzung ändert sich

Dichte wrid mit jedem Schritt komplexer

Vereinfachte Filterung

Vereinfachung der Likelihood $f(\underline{y} \mid \underline{x})$ durch Mixture (Analog zu vereinfachter Prädiktion)
$$ f(\underline{y} \mid \underline{x}) = \sum_{i \in \mathbb{Z}} f_i^y(\underline{y}) f_i^x(\underline{x}) $$

Sampling

Sun, 28 Aug 2022 00:00:00 +0000

Reapproximation von Dichten

Approximate original continuous density with discrete Dirac Mixture
$$ f(\underline{x})=\sum_{i=1}^{L} w_{i} \cdot \delta\left(\underline{x}-\underline{\hat{x}}_{i}\right) $$

Weights $w_{i}>0, \displaystyle \sum_{i=1}^{L} w_{i}=1$

$\underline{x}_i$: locations / samples

In univariate case (1D), compare cumulative distribution functions (CDFs) $\tilde{F}(x), F(x)$ using Cramér–von Mises distance:
$$ D(\underline{\hat{x}})=\int_{\mathbb{R}}(\tilde{F}(x)-F\left(x, \underline{\hat{x}})\right)^{2} \mathrm{~d} x $$
$F(x, \underline{\hat{x}})$ : Dirac mixture cumulative distribution
$$ F(x, \underline{\hat{x}})=\sum_{i=1}^{L} w_{i} \mathrm{H}\left(x-\hat{x}_{i}\right) \text { with } \mathrm{H}(x)=\int_{-\infty}^{x} \delta(t) \mathrm{d} t= \begin{cases}0 & x<0 \\ \frac{1}{2} & x=0 \\ 1 & x>0\end{cases} $$
with the Dirac position
$$ \underline{\hat{x}}=\left[\hat{x}_{1}, \hat{x}_{2}, \ldots, \hat{x}_{L}\right]^{\top} $$
We minimize the Cramér–von Mises distance $D(\underline{\hat{x}})$ with Newton’s method.

Generalization of concept of CDF

Localized Cumulative Distribution (LCD)
$$ F(\underline{m}, b)=\int_{\mathbb{R}^{N}} f(\underline{x}) K(\underline{x}-\underline{m}, b) \mathrm{d} \underline{x} $$

$K(\cdot, \cdot)$: Kernel
$$ K(\underline{x}-\underline{m}, b)=\prod_{k=1}^{N} \exp \left(-\frac{1}{2} \frac{\left(x_{k}-m_{k}\right)^{2}}{b^{2}}\right) $$

$\underline{m}$: Kernel location

$\underline{b}$: Kernel width

Properties of LCD:

Symmetric

Unique

Multivariate

Generalized Cramér–von Mises Distance (GCvD)
$$ D=\int_{\mathbb{R}_{+}} w(b) \int_{\mathbb{R}^{N}}(\tilde{F}(\underline{m}, b)-F(\underline{m}, b))^{2} \mathrm{~d} \underline{m} \mathrm{~d} b $$

$\tilde{F}(\underline{m}, b)$: LCD of continuous density

$F(\underline{m}, b)$: LCD of Dirac mixture

Minimization of GCvD: Quasi-Newton method (L-BFGS)

Projected Cumulative Distribution (PCD)

Use reapproximation methods for univariate case in multivariate case.

Radon Transform

Represent general $N$-dimensional probability density functions via the set of all one-dimensional projections

Linear projection of random vector $\underline{\boldsymbol{x}} \in \mathbb{R}^{N}$ to to scalar random variable $\boldsymbol{r} \in \mathbb{R}$ onto line described by unit vector $\underline{u} \in \mathbb{S}^{N-1}$
$$ \boldsymbol{r} = \underline{u}^\top \underline{\boldsymbol{x}} $$

Given probability density function $f(\underline{x})$ of random vector $\underline{\boldsymbol{x}}$, density $f_r(r \mid \underline{u})$ is Radon transfrom of $f(\underline{x})$ for all $\underline{u} \in \mathbb{S}^{N-1}$
$$ f_{r}(r \mid \underline{u})=\int_{\mathbb{R}^{N}} f(\underline{t}) \delta\left(r-\underline{u}^{\top} \underline{t}\right) \mathrm{d} \underline{t} $$

Representing PDFs by all one-dimensional projections

Represent the two densities $\tilde{f}(\underline{x})$ and $f(\underline{x})$ by their Radon transforms $\tilde{f}(r \mid \underline{u})$ and $f(r \mid u)$

Compare the sets of projections $\tilde{f}(r \mid \underline{u})$ and $f(r \mid u)$ for every $\underline{u} \in \mathbb{S}^{N-1}$. Resulting distance is
$$ D_{1}(\underline{u})=D(\tilde{f}(r \mid \underline{u}), f(r \mid \underline{u})) $$

Integrate these one-dimensional distance measures $D_1(\underline{u})$ over all unit vectors $\underline{u} \in \mathbb{S}^{N-1}$ to get the multivariate distance measure $D(\tilde{f}(\underline{x}), f(\underline{x}))$. Minimize via univariate Newton updates.

Navies Partikel Filter

Üb A13.2

Prädiktion

💡Update Sample Positionen. Gewichte bleiben gleich.

$f_{k}^{e}\left(\underline{x}_{k}\right)$ durch Dirac Mixture darstellen
$$ f_{k}^{e}\left(\underline{x}_{k}\right)=\sum_{i=1}^{L} w_{k}^{e, i} \cdot \delta\left(\underline{x}_{k}-\underline{\hat{x}}_{k}^{e, i}\right) \qquad w_{k}^{e, i}=\frac{1}{L}, i \in\left\{1, \ldots, L\right\} $$

Ziehe Samples zum Zeitpunkt $k+1$
$$ \underline{\hat{x}}_{k+1}^{p, i} \sim f\left(\underline{x}_{k+1} \mid \hat{x}_{k}^{e, i}\right) $$
Gewichte bleiben gleich
$$ w_{k+1}^{p, i} = w_{k}^{e, i} $$

$f_{k+1}^{p}\left(\underline{x}_{k}\right)$ durch Dirac Mixture darstellen
$$ f_{k+1}^{p}\left(\underline{x}_{k+1}\right)=\sum_{i=1}^{L} w_{k+1}^{p, i} \delta\left(\underline{x}_{k+1}-\underline{\hat{x}}_{k+1}^{p, i}\right) $$

Filterung

💡Update Gewichte. Sample Positionen bleiben gleich.
$$ \begin{aligned} f_{k}^{e}\left(\underline{x}_{k}\right) &\propto f\left(\underline{y}_{k} \mid \underline{x}_{k}\right) \cdot f_{k}^{p}\left(\underline{x}_{k}\right)\\ &=f\left(\underline{y}_{k} \mid \underline{x}_{k}\right) \cdot \sum_{i=1}^{L} w_{k}^{p, i} \cdot \delta\left(\underline{x}_{k}-\underline{\hat{x}}_{k}^{p, i}\right)\\ &=\sum_{i=1}^{L} \underbrace{w_{k}^{p, i} \cdot f\left(\underline{y}_{k} \mid \hat{\underline{x}}_{k}^{p, i}\right)}_{\propto w_{k}^{e, i}} \cdot \delta(\underline{x}_{k}-\underbrace{\underline{\hat{x}}_{k}^{p, i}}_{\underline{\hat{x}}_{k}^{e, i}}) \end{aligned} $$

Positionen bleiben gleich
$$ \underline{\hat{x}}_{k}^{e, i} = \underline{\hat{x}}_{k}^{p, i} $$

Gewichte adaptieren
$$ w_{k}^{e, i} \propto w_{k}^{p, i} \cdot f\left(\underline{y}_{k} \mid \hat{\underline{x}}_{k}^{p, i}\right) $$
und Normalisieren
$$ w_{k}^{e, i}:=\frac{w_{k}^{e, i}}{\displaystyle \sum_{i} w_{k}^{e,i}} $$

Problem

Varianz der Samples erhöht sich mit Filterschritten

Partikel sterben aus $\rightarrow$ Degenerierung des Filters

Aussterben schneller, je genauer die Messung, da Likelihood schmaler (Paradox!)

Resampling

Approximation der gewichteter Samples durch ungewichtete
$$ f_{k}^{e}\left(\underline{x}_{k}\right)=\sum_{i=1}^{L} w_{k}^{e, i} \cdot \delta\left(\underline{x}_{k}-\underline{\hat{x}}_{k}^{e, i}\right) \approx \sum_{i=1}^{L} \frac{1}{L} \delta\left(\underline{x}_{k}-\underline{\hat{x}}_{k}^{e, i}\right) $$

Gegeben: $L$ Partikel mit Gewichten $w_i$

Gesucht: $L$ Partikel mit Geweichte $\frac{1}{L}$ (gleichgewichtet)

Sequential Importance Sampling

$f_{k}^{e}\left(\underline{x}_{k}\right)=f\left(\underline{x}_{k} \mid \underline{y}_{1: k}\right)$ auf $\underline{x}_{1: k-1}$ marginalisieren
$$ f_{k}^{e}\left(\underline{x}_{k}\right)=f\left(\underline{x}_{k} \mid \underline{y}_{1: k}\right)=\int_{\mathbb{R}^{N}} \cdots \int_{\mathbb{R}^{N}} f\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right) d \underline{x}_{1: k-1} $$

Importance Sampling für $f(\underline{x}_k, \underline{x}_{k-1} \mid \underline{y}_{1:k})$
$$ f_{k}^{e}\left(\underline{x}_{k}\right) = \int_{\mathbb{R}^{N}} \cdots \int_{\mathbb{R}^{N}} \underbrace{\frac{f\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right)}{p\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right)}}_{=: w_k^{e, i}} p\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right) d \underline{x}_{1: k-1} $$

$\frac{f\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right)}{p\left(\underline{x}_{1: k} \mid \underline{y}_{1 : k}\right)}$ umschreiben

Zähler
$$ \begin{aligned} f\left(\underline{x}_{1: k} \mid \underline{y}_{1: k}\right) &\propto f\left(\underline{y}_{k} \mid \underline{x}_{1: k}, \underline{y}_{1: k - 1}\right) \cdot f\left(\underline{x}_{1: k} \mid \underline{y}_{1: k-1}\right)\\ &=f\left(\underline{y}_{k} \mid \underline{x}_{k}\right) \cdot f\left(\underline{x}_{k} \mid \underline{x}_{1:k-1}, \underline{y}_{1:k-1}\right) \cdot f\left(\underline{x}_{1:k-1} \mid \underline{y}_{1: k-1}\right)\\ &=f\left(\underline{y}_{k} \mid \underline{x}_{k}\right) \cdot f\left(\underline{x}_{k} \mid \underline{x}_{k-1}\right) \cdot f\left(\underline{x}_{1: k-1} \mid \underline{y}_{1: k \cdot 1}\right) \end{aligned} $$

Nenner
$$ p\left(\underline{x}_{1: k} \mid \underline{y}_{1: k}\right)=p\left(\underline{x}_{k} \mid \underline{x}_{1: k - 1}, \underline{y}_{1: k}\right) \cdot p\left(\underline{x}_{1: k -1} \mid \underline{y}_{1: k - 1}\right) $$

Einsetzen, $w_k^{e, i}$ in Rekursiven Form schreiben
$$ w_k^{e, i} = \frac{f\left(\underline{\hat{x}}_{1: k} \mid \underline{y}_{1 : k}\right)}{p\left(\underline{\hat{x}}_{1: k} \mid \underline{y}_{1 : k}\right)} \propto \frac{f\left(\underline{y}_{k} \mid \underline{x}_{k}^i\right) \cdot f\left(\underline{x}_{k}^i\mid \underline{x}_{k-1}^i\right)}{p\left(\underline{x}_{k}^i \mid \underline{x}_{1: k - 1}^i, \underline{y}_{1: k}\right)} \cdot \underbrace{\frac{f\left(\underline{x}_{1: k-1}^i \mid \underline{y}_{1: k \cdot 1}\right)}{p\left(\underline{x}_{1: k -1}^i \mid \underline{y}_{1: k - 1}\right)}}_{=w_{k-1}^{e, i}} $$
und Normalisieren.

Spezielle Proposals

Standard Proposal
$$ p\left(\underline{x}_{k} \mid \underline{x}_{k-1}, \underline{y}_{k}\right) \stackrel{!}{=} f\left(\underline{x}_{k} \mid \underline{x}_{k-1}\right) $$
Dann ist
$$ w_{k}^{e, i} \propto \frac{f\left(\underline{y}_{k} \mid \hat{\underline{x}}_{k}^{i}\right) \cdot f\left(\hat{\underline{x}}_{k}^{i} \mid \hat{\underline{x}}_{k-1}^{i}\right)}{p\left(\underline{\hat{x}}_{k}^{i} \mid \hat{\underline{x}}_{k-1}^{i}, \underline{y}_k\right)} \cdot w_{k-1}^{e, i}=f\left(\underline{y}_{k} \mid \hat{\underline{x}}_{k}^{i}\right) \cdot w_{k - 1}^{e, i} $$
Sehr einfach aber keine verbesserte Performance.

Optimales Proposal
$$ \begin{aligned} p\left(\underline{x}_{k} \mid \underline{x}_{k-1}, \underline{y}_{k}\right) &=f\left(\underline{x}_{k} \mid \underline{x}_{k-1}, \underline{y}_{k}\right) \\ & \propto f\left(\underline{y}_{k} \mid \underline{x}_{k}\right) \cdot f\left(\underline{x}_{k} \mid \underline{x}_{k-1}\right) \end{aligned} $$
Dann ist
$$ w_k^{e, i} = w_{k-1}^{e, i} $$
Minimierte Varianz der Gewichte aber nur in Spezialfällen verwendbar.

Häufige Prüfungsfragen

Tue, 13 Sep 2022 00:00:00 +0000

Allgemeine Fragen

Was haben wir in der Vorlesung gemacht/gelernt/behandelt?

Was ist Zustandsschätzung?

Was ist Zustand?

Welche Arten von Systemen sind einfach? Warum?

Wertdiskret und wertkontinuierlich linear.

Grund: konstanter Rechen- und Speicherbedarf

Wertdiskrete Systeme

Wonham Filter herleiten

Wertkontinuierliche lineare Systeme

Linear Kalman Filter herleiten

Eigeenschaften des KFs

Wertkontinuierliche schwache nichtlineare Systeme

Wie kann man erkennen, ob ein System stark oder schwach nichtlinear?

Vergleich mit Taylor Entwicklung 1. Ordnung

Induzierte Nichtlinearität

Was kann man machen, wenn das System schwach nichtlinear ist?

Wie funktioniert die Zustandsschätzung bei schwach nichtlinearen Systemen?

Wie funktioniert das EKF? EKF herleiten

UKF erklären und herleiten

Wie funktioniert die Filterung mit Samples?

Wie können wir Samples von der Priore erzeugen?

Unterschied zwischen UKF und EnKF?

NLKF (KF in probabilistischer Form)

Allgemine Systeme

###Chapman-Komolgorov Gleichung herleiten

Problem von allgemeinen Systeme?

Prädiktion: Parameterintergral bei Chapman-Komolgorov Gleichung

Integrand hängt von $\underline{x}_{k+1}$ ab (lässt sich i.Allg nicht herausziehen)

Nur möglich für analytische Lösung

Sonst erfordert (numerische) Lösung des Integrals für alle $\underline{x}_{k+1}$

Filterung

Type der Dichte zur Beschreibung der Schätzung ändert sich

Dichte wrid mit jedem Schritt komplexer

Wie kann man gegen Parameterintergral bei Prädiktion tun?

Transitionsdichte $f\left(\underline{x}_{k+1} \mid \underline{x}_k\right)$ durch entkoppelte Mixture approximieren
$$ f\left(\underline{x}_{k+1} \mid \underline{x}_k\right)=\sum_{i=1}^L w_k^{(i)} f_{k+1}^{(i)}\left(\underline{x}_{k+1}\right) f_k^{(i)}\left(\underline{x}_k\right) $$
Vorteil: die Integrande von CK-Gleichung, die von $\underline{x}_{k+1}$ , lässt sich rausziehen. Das Integral ist eine Konstante und wird als Faktor fürs neue Gewicht verwendet.

Generatives System (mit additivem oder multiplikativem Rauschen) in probabilistische überführen und herleiten

Sampling

Wie funktioniert Partikel Filter?

Introduction

Sun, 13 Sep 2020 00:00:00 +0000

What is NLP?

Wikipedia: Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction.

What is Dialog Modeling

Designing/building a spoken dialog system with its goals, user handling etc.

Synonymous to dialog management (DM)

Examples

Goal-oriented dialog

Social dialog / Chat bot

How to do NLP?

Aim: Understand linguistic structure of communication

Idea: There are rules to decide if a sentence is correct or not

A proper sentence needs to have:

1 Subject

1 Verb

several objects (depending on the verb’s valence)

TL;DR

Task:

Linguistic dimension: Syntax, semantics, pragmatics

Level: Word, word groups, sentence, beyond sentences

Approaches

Technique:

Rule-based,

Statistical,

Neural

Learning scenario:

Supervised,

semi-supervised,

unsupervised,

reinforcement learning

Model:

Classification,

sequence classification,

sequence labeling,

sequence to sequence,

structure prediction

Technique

Hand-written rules to parse the sentences (Rule-based)

‼️Problems

There is no fixed set of rules

Language changes over time

A(ny?) language is constantly influenced by other languages

Classification of words into POS tags not always clear

Corpus-based Approaches to NLP (Statistical)

Corpus = large collection of annotated texts (or speech files)

👍 advantages:

Automatically learn rules from data

Statistical Models → no hard decision

Use machine learning approaches

Possible since larger computation resources

Corpus will concentrate on most common approaches

Input:

Data (Text corpora)

Machine learning algorithm

Output: Statistical model

Problems of simple statistical models: feature engineering

What features are important to determine the POS tag

Word ending

Surrounding words

Capitalization

Deep learning Approaches to NLP (Neural)

Use neural networks to automatically infer features

Better generalization

Successfully applied to many NLP tasks

Learning scenarios

Supervised learning

Unsupervised learning

Semi supervised learning

Reinforcement learning

Model types

Model type Input Output Example task

Classification Fix input size
(E.g. word and surrounding k words) Label Word sense disambiguation

Sequence classification Sequence with variable length Label Sentiment analysis

Sequence labelling Sequence with variable length Label sequence with same length Named entity recognition

Sequence to Sequence model Sequence with variable length Sequence variable length Summarization

Structure prediction Sequence with variable length Complex structure Parsing

Resources

Texts

Brown Corpus

Penn Treebank

Europarl

Google books corpus

Dictionaries/Ontologies

WordNet,

GermaNet,

EuroWordNet

Approaches to Dialog Modeling

Many problems of NLP also apply to Dialog Modeling

Use conversational corpora for learning interaction patterns

Meeting Corpus (multiparty conversation)

Switchboard Corpus (telephone speech)

Problems ‼️

Very domain dependent

Need human interaction in training

Why is NLP hard?

Ambiguities! Ambiguities! Ambiguities!

Ambiguities

Examples:

Rare events

Calculate probabilities for events/words

Most words occur only very rarely

Most words occur one time

What to do with words that occur not in training data? 🧐

Zipf’s Law
$$ f \propto \frac{1}{r} $$

order list of words by occurrence

rank: position in the list

The frequency of any word is inversely proportional to its rank in the frequency table.

Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word

For example, in the Brown Corpus of American English text, the word the is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word of accounts for slightly over 3.5% of words (36,411 occurrences), followed by and (28,852).

Word Sense Disambiguation

Mon, 14 Sep 2020 00:00:00 +0000

Introduction

Definition

Word Sense Disambiguation

Determine which sense/meaning of a word is used in a particular context

Classification problem

Sense inventory

considered senses of the words

Word Sense Discrimination

Divide usages of a word into different meanings

Unsupervised Algorithms

Task

Determine which sense of a word is activated in a context

Find mapping $A$ for word $w_i$:
$$ A(i) \subseteq \operatorname{Sense}\_{D}\left(w\_{i}\right) $$

Mostly $|A(i)|=1$

Model as classification problem:

Assign sense based on context and external knowledge sources

Every word has different number of classes

$n$ distinct classification tasks ($n$ Vocabulary size)

Task-conditions

Word senses

Finite set of senses for every word

Automatic clustering of word senses

Sense inventories

coarse-grained

fine-grained

Text characteristics

domain-oriented

unrestricted

Target words

one target word per sentence

all words

Resources

Annotated data

Input data X and output/label data Y

Hard to acquire, but important

Supervised training

Unlabeled data

Input data X

Large amounts

Unsupervised data

Structured resources

Thesauri

Machine-readable dictionaries (MRDs)

Computation lexicon (Wordnet)

Ontologies

Unstructured resources

Corpora

Collocations resources

🔴 Problems

Sense definition is task dependent

Different algorithms for different applications

No discrete sense division possible

Knowledge acquisition bottleneck

Intermediate task

Application

Machine Translation (MT)

Information Retrieval (IR)

Question Answering (QA)

Semantic interpretation

Approaches

Dictionary- and Knowledge-Based

Lesk method / Gloss overlap

💡 Idea: Word used together in a text are related

Method: Find word sense with the most overlap of dictionary definition

Input: Dictionary with definition of the different word sense

Overlap calculation

Two words $w_1$ and $w_2$

For each pair of senses $S_1$ in $\operatorname{Senses}(w_1)$ and $S_2$ in $\operatorname{Senses}(w_2)$:
$$ \operatorname{score}\left(S_1, S_{2}\right)=\left|\operatorname{gloss}(S_1) \cap \operatorname{gloss}\left(S_{2}\right)\right| $$

$\operatorname{gloss}(S_1)=\text{bag of words of definition of } S_1$

Problem: Many words in the context -> calculation very slow 🤪
$$ \prod_{i=1}^{n} \operatorname{Senses}\left(w_{i}\right) $$

Variant (simplified): Calculate overlap between context (set of words in surrounding sentence or paragraph) and gloss:
$$ \operatorname{score}(S)=|\operatorname{context}(w) \cap \operatorname{gloss}(S)| $$

Example:

Problems:

depend heavily on the exact definition

definitions are often very short

Supervised

💡 Train classifier using annotated examples (i.e., annotated text corpora)

Input features: Use context to disambiguate words

Problems:

high-dimension of the feature space

data sparseness problem

Techniques:

Naive Bayes classifier

Instance-based Learning

SVM

Ensemble Methods

Neural Networks (e.g. Bi-LSTM)

Feature extraction

Feature vector:

Vector describing input data

Fixed number of dimensions

Challenges:

Variable sentence length

Unknown number of words

Two kinds of features in the vectors:

Collocational: Features about words at specific positions near target word

Think as a (ordered) list

Often limited to just word identity and POS

Example:

Bag-of-words: Features about words that occur anywhere in the window (regardless of position)

Think as “an unordered set of words”

Typically limited to frequency counts

How it works?

Counts of words occur within the window.

First choose a vocabulary

Then count how often each of those terms occurs in a given window

sometimes just a binary “indicator” 1 or 0

Example:

Text processing

Tokenization

Part-of-speech tagging

Lemmatization

Chunking: divided text into syntactically correlated part

Parsing

Feature definition

Local features

surrounding words, POS tags, position with respect to target word

Topical/Global features

general topic of a text

mostly bag-of-words representation of (sentence, paragraph, …)

Syntactic features

syntactic clues

can be outside the local context

Semantic features

previous determined sense of words in the context

Naive Bayes classifier

Input:

a word $w$ in a text window $d$ (which we’ll call a “document”)

a fixed set of classes $C = \{c_1, c_2, \dots, c_j\}$

A training set of $m$ hand-labeled text windows again called

“documents” $(d_1, c_1), \dots, (d_m, c_m)$

Output: a learn classifier
$$ \gamma: d \to c $$

$P(c)$: prior probability of that sense

Counting in a labeled training set

$P(w|c)$: conditional probability of a word given a particular sense

$p(w|c) = \frac{\operatorname{count}(w, c)}{\operatorname{count}(c)}$

(We get both of these from a tagged corpus)

Example:

Example of naive bayes classfier

Instance-based Learning

Build classification model based on examples

k-Nearest Neighbor (k-NN) algorithm

💡Idea:

represent examples in vector space

define distance metric in vector space

find $k$ nearest neighbor

take most common sense in the k nearest neighbors

Distance: e.g., Hamming distance
$$ \Delta\left(x, x_{i}\right)=\sum w_{j} \delta\left(x_{j}, x_{i_{j}}\right) $$

$\delta\left(x_{j}, x_{i_j}\right)=0$ if $x_{j}=x_{i_j},$ else 1

$w_j$: weight (e.g., Gain ration measure)

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.

Example:

Ensemble Methods

Combine different classifier

classifier have strength in different situation

improve by asking several experts

Algorithm:

Score input by several First-order classifier

Combine results

Result:

Only best hypothesis (majority vote)

take decision of most classifiers

if tie, randomly choose between them
$$ \hat{S}=\underset{S\_i \in \text{Sense}\_{D(w)}}{\operatorname{argmax}}|j: \operatorname{vote}(C\_{j})=S\_{j}| $$

Score for all hypothesis (Probability Mixture)

Normalize scores of every classifier to get probability
$$ P\_{C\_{j}}(S\_i)=\frac{\operatorname{score}\left(C\_{j}, S\_i\right)}{\sum \operatorname{score}\left(C\_{j}, S\_i\right)} $$

Take class with highest sum of probabilities

$$ \hat{S}=\underset{S\_i \in \operatorname{Sense}\_D(w)}{\operatorname{argmax}}\sum\_{j=1}^{m}P\_{c\_j}(S\_i) $$

Ranking of all hypothesis (Rank-based Combination) $$ \hat{S}=\underset{S\_i \in \operatorname{Sense}\_D(w)}{\operatorname{argmax}}\sum\_{j=1}^{m} -\operatorname{Rank}\_{c\_j}(S_i) $$

Semi-supervised

‼️ Knowledge acquisition bottleneck: hard to get large amounts of annotated data

💡 Idea of Semi-supervised approaches:

Some initial model trained on small amounts of annotated data

Improve model using raw data

Bootstrapping

Seed data:

manual annotated

surefire decision rules

Train classifier on annotated data A

Select subset U’ of unlabeled data

Annotate U’ with classifier

Filter most reliable examples

Add examples to A

Repeat from training

Self-training

always use same classifier

Co-training

train classifier 1 (e.g. using local feature)

Annotate $P’$ with classifier 1

train classifier 2 (e.g. topical information) on $P’$ and A

Annotate $P’_2$ with classifier 2

train classifier 1 …

Unsupervised

💡 Idea

If a word is used in similar context, the meaning should be similar

If the word is used in completely different context, different meaning

Approach: Cluster contexts of words

Context clustering

Word space model:

Vector space with dimension of the words

vector for word $w$:

$j$-th component: number of co-occurs of $w$ and $w\_j$

Similarity:
$$ \operatorname{sim}(v, w)=\frac{v^{*} w}{|v|^{\*}|w|}=\frac{\displaystyle\sum\_{i=1}^{m} v\_{i} \* w\_{i}}{\sqrt{\displaystyle\sum_{i=1}^{m} v\_{i}^{2} \displaystyle\sum_{i=1}^{m} w\_{i}^{2}}} $$

Example:

Dimension: (food, bank)

restaurant=(210, 80)

money = (100, 250)

‼️ Problem:

sparse representation

latent semantic analyses (LSA)

Context representation

Second-order vectors: average of all word vectors in the context

Example:

Cluster contexts

Agglomerative clustering

Start with one context per cluster

Merge most similar clusters

Continue until threshold is reached

Co-occurrence Graphs

HyperLex: Co-occurrence graph for one target ambiguous word $w$

Nodes: All Words occurring in a paragraph with $w$

Edge: words occur in same paragraph

Weight:
$$ \begin{array}{c} w_{i j}=1-\max \left(P\left(w_{i} | w_{j}\right), P\left(w_{j} | w_{i}\right)\right) \\\\ P\left(w_{i} | w_{j}\right)=\frac{f r e q_{i j}}{f r e q_{j}} \end{array} $$

Low weight -> High probability of co-occurring

Discard edges with very high weight

How HyperLex works?

Select Hubs (Nodes with highest degree)

Connect target words with weight 0 to hubs

Calculate Minimal Spanning Tree

See Target word in Context $W = (w_1, w_2, \dots, w_n)$

Calculate vector for every word with $s_k$ (if $w_j$ ancestor of $h_k$)
$$ s_{k}=\frac{1}{1+d\left(h_{k}, w_{j}\right)} $$

Sum all vectors and assign to hub with highest sum

Evaluation

Hand-annotated data

Precision

Recall

Task:

Lexical sample: only some words need to be disambiguate

All-words: all words need to be disambiguate

Baseline:

Random baseline: Randomly choose one class

First Sense Baseline: Always take most common sense

Sentiment Analysis

Tue, 15 Sep 2020 00:00:00 +0000

Introduction

Definition

Sentiment analysis / opinion mining

Determine opinion, sentiment and subjectivity in text

What is the authors opinion about something?

What are the pros and cons?

Important task in natural language processing

Application

Automatically maintain review and opinion-aggregation websites

Web search target towards reviews

generate results with variety of opinions

Improve customer relationship management

Automatically analyze customer feedback

Predict public attitudes towards brand/politics

Ad placement

Advertise products near positive text

Summarization

Question-answering

Challenges

Deep undetstanding

Co-reference resolution

Negation handling

Different hints in the text

Tasks of SA

Polarity classification

binary classifier if text, sentence, document is positive or negative

Agreement detection

Do two text agree on their opinion?

Rating

How does the user rate a product (1 to 5 stars)

Subjectivity detection

Is a text or sentence subjective or objective?

Feature/aspect-based sentiment analysis

Opinions express on different features/aspects

Viewpoints and perspectives

Polar classification

Task:

Input: Text (Sentence, Document, Several Documents) (variable length)

Output: positive or negative opinion

Sequence classification

Techniques:

Keyword spotting

Lexical affinity

Statistical methods

concept-based approaches

Keyword spotting

Classify based on occurrence of unambiguous affect words

E.g.: happy, sad, afraid, bored

‼️ Problems

affect-negated words

E.g.: “today was a happy day” vs “today wasn’t a happy day at all”

surface features

Often no obvious affect words are present

Lexical affinity

Increase the number of considered words

Assign “probable” affinity to particular emotions

Example: Accident (75% of indicating a negative affect (car accident))

Train probabilities from linguistic corpora

‼️ Problems

negated sentences (“I avoided an accident”)

Words with different meaning (“I met my girlfriend by accident”)

Bias towards training data –> domain-dependent

Statistical methods

💡 Use machine-learning algorithm to train classifier

Input:

Represent input text as features vector

Feature selection important for classification performance

Classifier:

Naive Bayes

Support Vector Machines

Maximum-entropy-based classification

Features

Word representation

Position information

POS infromation

Syntax: Tree-based features

Feeatures Negation

Negation should invert the features of the sentence

Approaches:

Attach NOT to all words near a negation

However

Not all negation reverse meaning

“No wonder this is considered one of the best “

Negation do often not use a key word

“it avoids all clichés and predictability found in Hollywood movies”

Topic-oriented features

Opinion of a sentence depend on topic of the article

Approach: Replace subject of the article by general term

Domain adaptation

Meaning depends on the domain

Different approaches to transfer knowledge from one domain to another

Search domain-independent features

Structural correspondence learning algorithm

Unsupervised approaches

Unsupervised lexicon induction

Find adjectives using linguistic heuristics

words that co-occur with “but”

elegant but over-priced

words that co-occur with “and”

clever and informative

Build graph

Cluster or build binary-partition

Assign polarity using some seed words

Relation identification

Sentence relationship

Objective and subjective sentence in a review

No random order

After subjective sentence most probable also subjective sentence

First cluster sentence into objective and subjective

Use labels of the surrounding sentences

Then Use only subjective sentence to classify polarity of review

Order of sentence is important

End is more important than beginning

Use trajectory of local sentiments

Dialog structure

Class structure

One-vs.-all multi-class categorization

Model as Metric labeling problem

‼️ Problems of statistical methods

Need enough text to perform classification

Good performance on page and paragraph level

Problems on sentence or clause level

Concept-based approaches

Perform semantic text analysis

Resources:

Web ontologies

Semantic networks

try to recognize meaning/features

heavily rely on depth and breadth of knowledge base

Opinion summarization

Opinion-oriented extraction

Example: “What is the best about the new iPhone?”

Approach

Extract product features

nouns / frequent nouns

heuristic pruning

extract opinions associated with these features

sometimes also extract the opinion holder

What is opinion summarization?

Generate summary of large number of opinions

Aggregate results of sentiment prediction

Structured summaries:

breakdown by aspects/topics

text or visualization

Conceptual Framework

aspect-based summarization

non-aspect-based summarization

Aspect-based

💡 Divide input text into aspect/features/subtopics

E.g.: Review on iPod

battery life

design

price

Show structured details

How aspect-based opinion summarization works?

Framework

aspect/feature identification

find important topics

sentiment prediction

determine the sentiment orientation

is the aspect judged positive/negative?

summary generation

present results

Aspect/Feature Identification

Find subtopics (In some cases already known)

Techniques

NLP-based approaches using POS-tagging /parse trees

Shallow parsing

use additional knowledge

Sentiment prediction

Predict sentiment for the different aspects

Learning approach:

Learn aspect level ratings using the global rating

Naive Bayes classifier

‼️ Problem: label examples is expensive

Most approaches use lexicon/rule-based methods

e.g. list of positive and negative words (extend by wordNet)

Summary Generation

Generate and present the opinion summaries

Statistical Summary

show statistics about opinion on different aspect

directly use sentiment prediction output

easy to understand

Text selection

show small pieces of text as the summary

show strongest opinion words for every aspect

Aggregate Ratings

Show statistics and text selection

Summary with timeline

Show opinion trends over a timeline

Example:

Integrated Approaches

No clear separation of the different steps

Topic Sentiment Mixture Model

unsupervised approach

sentiment prediction and aspect identification in one step

Model: Probabilistic latent semantic analysis (PLSA)

Multi-task learning

CNN-based approach

C predefined aspect mappers

Sentiment classfiers

shared word embedding layer

LSTM with attention

Input: word embedding and aspect embedding

Relevant parts of the sentence identified through attention mechanism

Non-aspect-based opinion summarization

Basic Sentiment Summarization

Classify each input text separately

Count number of positive and negative opinions

Text Summarization

Opinion Integration:

Expert opinions: complete, but rarely updated

Ordinary opinions: unstructured, but updated more often

Combine both by first extracting information from expert opinions

Add information from the ordinary opinions

Contrastive Opinion Summarization

Show positive and negative aspects

Abstractive Text summarization

Part-of-Speech Tagging

Tue, 15 Sep 2020 00:00:00 +0000

Part-of-Speech Tagging

What is Part-of-Speech Tagging?

Part-of-Speech tagging:

Grammatical tagging

Word-category disambiguation

Task: Marking up a word in a text as corresponding to a particular part of speech

based on definition and context

Word level task: Assign one class to every word

Variations:

English schools: 9 POS

noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection.

POS-tagger: 50 – 150 classes

Plural, singular

POS + Morph tags:

More than 600

Gender, case, …

Data sources

Brown corpus

Penn Tree Bank

Tiger Treebank

🔴 Problems

Ambiguities

E.g.: “A can of beans” vs. “We can do it”

Many content words in English can have more than 1 POS tag

E.g.: play, flour

Data sparseness: What to do with rare words?

Disambiguate using context information 💪

Example applications

Information extraction

QA

Shallow parsing

Machine Translation

How to do POS Tagging?

Rule-based

Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word. If the word has more than one possible tag, then rule-based taggers use hand-written rules to identify the correct tag. Disambiguation can also be performed in rule-based tagging by analyzing the linguistic features of a word along with its preceding as well as following words. For example, suppose if the preceding word of a word is article then word must be a noun.

Design rules to assign POS tags to words

How can one decide on the right POS tag used in a context?

Two sources of information:

Tags of other words in the context of the word we are interested in

knowing the word itself gives a lot of information about the correct tag

Syntagmatic approach

most obvious source of information

With rule-based approach only 77% tagged correctly 🤪

Example

Should play get an NN or VBP tag?

Take the more common POS tag sequence for phrase a new play:

AT JJ NN vs. AT JJ VBP

Lexical information

assign the most common tag to a word

90% correct !!! (favorable conditions)

So useful because the distribution of a word’s usages across different POS is typically extremely uneven → usually occur as 1 POS

All modern taggers use a combination of syntagmatic and lexical information.

Statistical approaches should work well on POS tagging, assuming a word has different POS tags according certain a priori probabilities

Brill-Tagger

Developed by Eric Brill in 1995

Algorithm

Initialize:

Every word gets most frequent POS

Unknown: Noun

Until no longer possible

Apply rules

Rules

Linguistically motivated

Machine learning algorithms

Wiki:

The Brill tagger is an inductive method for part-of-speech tagging. It can be summarized as an “error-driven transformation-based tagger”.

It is:

a form of supervised learning, which aims to minimize error; and,

a transformation-based process, in the sense that a tag is assigned to each word and changed using a set of predefined rules.

In the transformation process,

if the word is known, it first assigns the most frequent tag,

if the word is unknown, it naively assigns the tag “noun” to it.

Applying over and over these rules, changing the incorrect tags, a quite high accuracy is achieved.

Statistical

Probabilistic tagging: Model POS tags as Sequence labeling

Wiki:

In machine learning, sequence labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values.

A common example of a sequence labeling task is part of speech tagging, which seeks to assign a part of speech to each word in an input sentence or document. Sequence labeling can be treated as a set of independent classification tasks, one per member of the sequence. However, accuracy is generally improved by making the optimal label for a given element dependent on the choices of nearby elements, using special algorithms to choose the globally best set of labels for the entire sequence at once.

Sequence labeling

Input: sequence $x\_1, \dots, x\_n$

Output: Sequence $y\_1, \dots, y\_n$

Example

Model as Machine Learning Problem

💡 Classify each token independently but use as input features, information about the surrounding tokens (sliding window).

Training data

Label sequence $\left\\{\left(x^{1}, y^{1}\right),\left(x^{2}, y^{2}\right), \ldots,\left(x^{M}, y^{M}\right)\right\\}$

Learn model: $X \to Y$

Problem: Exponential number of solutions!!!

Number of solutions: $\text{#Classes}^{\text{#Words}}$

-> Can NOT directly model $P(y|x)$ or $P(x, y)$ 🤪

The model that includes frequency or probability (statistics) can be called stochastic. Any number of different approaches to the problem of part-of-speech tagging can be referred to as stochastic tagger.

Decision Trees

Automatically learn which question to ask

Probabilistic tagging

Define probability for tag sequence recursively

Using two models

$P(t\_n | t\_{n-1}, t\_{n-2})$: model using decision tree

$P(w\_n | t\_n)$

Lexicon

Suffix lexicon for unknown words

Which POS tag attached to unknown words

Depending on the ending some POS tags are more probable

Condition Random Fields (CRFs)

Wiki:

Conditional random fields (CRFs) are a class of statistical modeling method often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering “neighboring” samples, a CRF can take context into account.

Hidden Markov Model (HMM):

Hidden states: POS

Output: Words

Task: Estimate state sequence from output

Generative model

Assign a joint probability $P(x, y)$ to paired observation and label sequences

Problem when modeling $P(x)$

Introduce highly dependent features

Example: Word, Capitalization, Suffix, Prefix

Possible solutions:

Model dependencies

How does the capitalization depend on the suffix?

Independence assumption

Hurts performance

Discriminative Model

Directly model $P(y|x)$

No model for $P(x)$ is involved

Not needed for classification since x is observed

Linear Chain Conditional Random Fields

$x$: random variable (Representing the input)

$y$: random variable (POS tags)

$\theta$: Parameter

$f(y, y', x)$: feature function

Model:
$$ p(\mathbf{y} | \mathbf{x})=\frac{1}{Z(\mathbf{x})} \prod\_{t=1}^{T} \exp \left\\{\sum\_{k=1}^{K} \theta\_{k} f\_{k}\left(y\_{t}, y\_{t-1}, \mathbf{x}\_{t}\right)\right\\} $$ $$ Z(\mathrm{x})=\sum\_{\mathbf{y}} \prod\_{t=1}^{T} \exp \left\\{\sum\_{k=1}^{K} \theta\_{k} f\_{k}\left(y\_{t}, y\_{t-1}, \mathbf{x}\_{t}\right)\right\\} $$
Feature functions

First-order dependencies

$\mathbf{1}(y'=\text{DET}, y=\text{NN})$

$\mathbf{1}(y'=\text{DET}, y=\text{VB})$

Lexical: $\mathbf{1}(y=\text{DET}, x=\text{"the"})$

Lexical with context: $\mathbf{1}(y'=\text{NN}, x=\text{"can"}, \operatorname{pre}(x)=\text{"the"})$

Additional features: $\mathbf{1}(y=\text{NN}, \operatorname{cap}(x)=true)$

Inference

Task: Get most probabale POS sequence

Problem: Exponential number of label sequences 🤪

Linear-chain layout

Dynamic programming can be used

$\rightarrow$ Efficient computing

Training

Task: How to find the best weight $\theta$ ?

💡 Maximum (Log-)Likelihood estimation

Maximize probability of the training data

Given: $M$ sequence with labels $(x^M, y^M)$

Maximize
$$ l(\theta)=\sum \log \left(P\left(y^{k} | x^{k}, \theta\right)\right. $$

Regularization

Prevent overfitting by prefering lower weights
$$ \sum\_{k=1}^{M} \log \left(P\left(y^{k} | x^{k}, \theta\right)\right)-\frac{1}{2} C\|\theta\|^{2} $$

Convex function

$\Rightarrow$ Can use gradient descent to find optimal value 👏

Neural Network

🔴 Data sparseness Problem

Many words have rarely seen in training $\Rightarrow$ Hard to estimate probabilities 🤪

CRFs:

Use many features to represent the word

Problem: A lot of engineering!

Neural networks

Able to learn hidden representation

Learn representation of words based on letters, E.g.:

Words ending on ness with be nouns

Words ending on phoby will be nouns

Words ending on ly are often adverbs

Structure

First layer: Word representation

CNN

Learn mapping: Word $\to$ continuous vector

Second layer:

Use several words to predict POS tag

Feed forward net

RNN: Contain complete history

Training

Train both layers together using backpropagation

Named Entity Recognition

Tue, 15 Sep 2020 00:00:00 +0000

Introduction

Definition

Named Entity: some entity represented by a name

Named Entity Recognition: Find and classify named entities in text

Why useful?

Create indices & hyperlinks

Information extraction

Establish relationships between named entities, build knowledge base

Question answering: answers often NEs

Machine translation: NEs require special care

NEs often unknown words, usually passed through without translation.

Why difficult?

World knowledge

Non-local decisions

Domain specificity

Labeled data is very expensive

Label Representation

IO

I: Inside

O: Outside (indicates that a token belongs to no chunk)

BIO

B: Begin

Wiki

The IOB format (short for inside, outside, beginning), a synonym for BIO format, is a common tagging format for tagging tokens in a chunking task in computational linguistics.

B-prefix before a tag: indicates that the tag is the beginning of a chunk

I-prefix before a tag: indicates that the tag is the inside of a chunk

O tag: indicates that a token belongs to NO chunk

BIOES

E: Ending character

S: single element

BILOU

L: Last character

U: Unit length

Example:

Fred showed Sue Mengqiu Huang's new painting

Data

CoNLL03 shared task data

MUC7 dataset

Guideline examples for special cases:

Tokenization

Elision

Evaluation

Precision and Recall
$$ \text{Precision} = \frac{\\# \text { correct labels }}{\\# \text { hypothesized labels }} = \frac{TP}{TP + FP} $$ $$ \text{Recall} = \frac{\\# \text { correct labels }}{\\# \text { reference labels }} = \frac{TP}{TP + FN} $$

Phrase-level counting

System 1:

“$\text{\\$200,000,000}$" is correctly recognized as NE $\Rightarrow$ TP =1

“First Bank of Chicago” is incorrectly recognised as non-NE (i.e., O) $\Rightarrow$ FN = 1

Therefore:

$\text{Precision} = \frac{1}{1 + 0} = 1$

$\text{Recall} = \frac{1}{1 + 1} = \frac{1}{2}$

System 2:

“$\text{\\$200,000,000}$" is correctly recognized as NE $\Rightarrow$ TP =1

For “First Bank of Chicago”

Word Actual label Predicted label

First ORG O

Bank of Chicago ORG ORG

There’s a boundary error (since we consider the whole phrase):

FN = 1

FP = 1

Therefore:

$\text{Precision} = \frac{1}{1 + 1} = \frac{1}{2}$

$\text{Recall} = \frac{1}{1 + 1} = \frac{1}{2}$

Problems

Punish partial overlaps

Ignore true negatives

Token-level

In token-level, we consider these tokens: “First”, “Bank”, “of”, “Chicago”, and “$200,000,000”

System 1

“$\text{\\$200,000,000}$" is correctly recognized as NE $\Rightarrow$ TP =1

“First”, “Bank”, “of”, “Chicago” are incorrectly recognised as non-NE (i.e., O) $\Rightarrow$ FN = 4

Therefore:

$\text{Precision} = \frac{1}{1 + 0} = 1$

$\text{Recall} = \frac{1}{1 + 4} = \frac{1}{5}$

Partial overlaps rewarded!

But

longer entities weighted more strongly

True negatives still ignored 🤪

$F\_1$ score (harmonic mean of precision and recall)
$$ F\_1 = \frac{2 \times \text { precision } \times \text { recall }}{\text { precision }+\text { recall }} $$

Text Representation

Local features

Previous two predictions (tri-gram feature)

$y\_{i-1}$ and $y\_{i-2}$

Current word $x\_i$

Word type $x\_i$

all-capitalized, is-capitalized, all-digits, alphanumeric, …

Word shape

lower case - ‘x’

upper case - ‘X’

numbers - ’d'

retain punctuation

Word substrings

Tokens in window

Word shapes in window

…

Non-local features

Identify tokens that should have same labels

Type:

Context aggregation

Derived from all words in the document

No dependencies on predictions, usable with any inference algorithm

Prediction aggregation

Derived from predictions of the whole document

Global dependencies; Inference:

first apply baseline without non-local features

then apply second system conditioned on output of first system

Extended prediction history

Condition only on past predictions –> greedy / beam search

💡 Intuition: Beginning of document often easier, later in document terms often get abbreviated

Sequence Model

HMMs

Generative model

Generative story:

Choose document length $N$

For each word $t = 0, \dots, N$:

Draw NE label $\sim P(y\_t | y\_{t-1})$

Draw word $\sim P\left(x\_{t} | y\_{t}\right)$
$$ P(\mathbf{y}, \mathbf{x})=\prod P\left(y\_{t} | y\_{t-1}\right) P\left(x | y\_{t}\right) $$

Example

👍 Pros

intuitive model

Works with unknown label sequences

Fast inference

👎 Cons

Strong limitation on textual features (conditional independence)

Model overly simplistic (can improve the generative story but would lose fast inference)

Max. Entropy

Discriminative model $P(y\_t|x)$

Don’t care about generation process or input distribution

Only model conditional output probabilities

👍 Pros: Flexible feature design

👎 Cons: local classifier -> disregard sequence information

CRF

Discriminative model

👍 Pros:

Flexible feature design

Condition on local sequence context

Training as easy as MaxEnt

👎 Cons: Still no long-range dependencies possible

Modelling

Difference to POS

Long-range dependencies

Alternative resources can be very helpful

Several NER more than one word long

Example

Inference

Viterbi

Finds exact solution

Efficient algorithmic solution using dynamic programming

Complexity exponential in order of Markov model

Only feasible for small order

Greedy

At each timestep, choose locally best label

Fast, support conditioning on global history (not future)

No possibility for “label revision”

Beam

Keep a beam of the $n$ best greedy solutions, expand and prune

Limited room for label revisions

Gibbs Sampling

Stochastic method

Easy way to sample from multivariate distribution

Normally used to approximate joint distributions or intervals
$$ P\left(y^{(t)} | y^{(t-1)}\right)=P\left(y\_{i}^{(t)} | y\_{-i}^{(t-1)}, x\right) $$

$-1$ means all states except $i$

💡 Intuitively:

Sample one variable at a time, conditioned on current assignment of all other variables

Keep checkpoints (e.g. after each sweep through all variables) to approximate distribution

In our case:

Initialize NER tags (e.g. random or via Viterbi baseline model)

Re-sample one tag at a time, conditioned on input and all other tags

After sampling for a long time, we can estimate the joint distribution over outputs $P(y|x)$

However, it’s slow, and we may only be interested in the best output 🤪

Could choose best instead of sampling
$$ y^{(t)}=y\_{-i}^{(t-1)} \cup \underset{y\_{i}^{(t)}}{\operatorname{argmax}}\left(P\left(y\_{i}^{(t)} | y\_{-i}^{(t-1)}, x\right)\right) $$

will get stuck in local optima 😭

Better: Simulated annealing

Gradually move from sampling to argmax $$ P\left(y^{(t)} | y^{(t-1)}\right)=\frac{P\left(y\_{i}^{(t)} | y\_{-i}^{(t-1)}, x\right)^{1 / c\_{t}}}{\displaystyle\sum\_{j} P\left(y\_{j}^{(t)} | y\_{-j}^{(t-1)}, x\right)^{1 / c\_{t}}} $$

External Knowledge

Data

Supervised learning:

Label Data:

Text

NE Annotation

Unsupervised learning

Unlabeled Data: Text

Problem: Hard to directly learn NER

Semi-supervised: Labeled and Unlabeled Data

Word Clustering

Problem: Data Sparsity

Idea

Find lower-dimensional representation of words

real vector /probabilities have natural measure of similarity

Which words are similarr?

Distributional notion

if they appear in similar context, e.g.

“president” and “chairman” are similar

“cut” and “knife” not

Words in same cluster should be similar

Brown clusters

Bottom-up agglomerative word clustering

Input: Sequence of words $w\_1, \dots, w\_n$

Output

binary tree

Cluster: subtree (according to desired #clusters)

💡 Intuition: put syntacticly “exchangable” words in same cluster. E.g.:

Similar words: president/chairman, Saturday/Monday

Not similar: cut/knife

Algorithm:

Initialization: Every word is its own cluster

While there are more than one cluster

Merge two clusters that maximizes the quality of the clustering

Result:

Hard clustering: each word belongs to exactly one cluster

Quality of the clustering

Use class-based bigram language model

Quality: logarithm of the probability of the training text normalized by the length of the text
$$ \begin{aligned} \text { Quality }(C) &=\frac{1}{n} \log P\left(w\_{1}, \ldots, w\_{n}\right) \\\\ &=\frac{1}{n} \log P\left(w\_{1}, \ldots, w\_{n}, C\left(w\_{1}\right), \ldots, C\left(w\_{n}\right)\right) \\\\ &=\frac{1}{n} \log \prod\_{i=1}^{n} P\left(C\left(w\_{i}\right) | C\left(w\_{i-1}\right)\right) P\left(w\_{i} | C\left(w\_{i}\right)\right) \end{aligned} $$

Parameters: estimated using maximum-likelihood

Parsing

Tue, 15 Sep 2020 00:00:00 +0000

TL;DR

Representing and Analyze Sentence Structure

Phrase structure grammar

Context free grammar

Problems:

Ambiguities : PP Attachment

Traditional Approaches

Stochastically Parsing

Probabilistic Context Free Grammar

CYK Algorithm

Transition-based parsing

Grammaticality

Common approach in statistical natural language processing: n-gram Language Model

E.g., tri-gram
$$ \begin{array}{l} P\left(w_{1}, \ldots, w_{n}\right) \\ =P\left(w_{1}\right) * P\left(w_{2} \mid w_{1}\right) * P\left(w_{3} \mid w_{1} w_{2}\right) \ldots \\ \approx P\left(w_{n} \mid w_{n-2} w_{n-1}\right) \end{array} $$
Problems of Language Models

Generalization: even with very long context there are sentence you cannot model with a n-gram language model

Overall sentence structure

How can we model what a grammatically correct sentence is?

Need arbitrary context

Use grammar describing generation of the sentence

Phrase structure grammar

Describe sentence structure by grammar (Constituency relation)

Phrase structure organizes words into nested constituents (can represent the grammar with CFG rules)

Units in the grammar: Constituency

Can be moved around

I saw you today

Today, I saw you

expand/contract

I saw the boy

I saw him

I saw the old boy

Wiki

In syntactic analysis, a constituent is a word or a group of words that function as a single unit within a hierarchical structure. The constituent structure of sentences is identified using tests for constituents.

A phrase is a sequence of one or more words (in some theories two or more) built around a head lexical item and working as a unit within a sentence. A word sequence is shown to be a phrase/constituent if it exhibits one or more of the behaviors discussed below.

Phrase structure rules

Describe syntax of language

Example

s –>NP VP (Sentence consists of a noun phrase and a verb phrase)

NP –> Det N (A noun phrase consists of a determiner and a noun)

Only looking at the syntax

No semantics

Wiki:

In linguistics, phrase structure grammars are all those grammars that are based on the constituency relation, as opposed to the dependency relation associated with dependency grammars; hence, phrase structure grammars are also known as constituency grammars

The fundamental trait that these frameworks all share is that they view sentence structure in terms of the constituency relation.

Example: Constituency relation Vs. Dependency relation

Context Free Grammar

Constituency = phrase structure grammar = context-free grammars (CFGs)

Introduced by Chomsky

4-tuple:
$$ G = (V, \Sigma, R, S) $$

$V$: finite set of non-terminals

variables describing the phrases (NP, VP, …)

$\Sigma$: finite set of terminals

content of the sentence

all words in the grammar

$R$: finite relation $V$ to $(V \cup \Sigma)^{\*}$

Rules defining how non-terminals can be replaced

E.g.: s –>NP VP

$S$: start symbol

Example

Dependency Structure

Different approach to describe sentence structure

Identify semantic relations!

Idea:

Which words depend on which words

Which word modifies which word

Example:

Wiki

The (finite) verb is taken to be the structural center of clause structure. All other syntactic units (words) are either directly or indirectly connected to the verb in terms of the directed links, which are called dependencies.

A dependency structure is determined by the relation between a word (a head) and its dependents. Dependency structures are flatter than phrase structures in part because they lack a finite verb phrase constituent, and they are thus well suited for the analysis of languages with free word order, such as Czech or Warlpiri.

Difficulties

Ambiguities!!!

E.g.: Prepositional phrase attachment ambiguity

Parsing

Automatically generate parse tree for sentence

Given:

Grammar

Sentence

Find: hidden structure

Idea: Search for different parses

Applications

Question – Answering

Named Entity extraction

Sentiment analysis

Sentence Compression

Traditional approaches

Hand-defined rules: restrict rules by hand to have at best only one possible parse tree

🔴 Problems

Many parses for the same sentence

Coverage Problem (Many sentences could not be parsed)

Time and cost intensive

Statistical parsing

Use machine learning techniques to distinguish probable and less probable trees

Automatically learn rules from training data

Hand-annotated text with parse trees

still many parse trees for one sentence 🤪

But weights define most probable

Tasks

Training: learn possible rules and their probabilities

Search: find most probable parse tree for sentence

Annotated Data

Treebank:

human annotated sentence with structure

Words

POS Tags

Phrase structure

👍 Advantages:

Reusable

High coverage

Evaluation

Example

Probabilistic Context Free Grammar

Extension to Context Free Grammar

Formel definition: 5 tuple
$$ G = (V, \Sigma, R, S, P) $$

$V, \Sigma, R, S$: same as Context Free Grammar

$P$: set of Probabilities on production rules

E.g.: s –>NP VP 0.5

Properties

Probability of derivation is product over all rules
$$ P(D)=\prod_{r \in D} P(r) $$

Sum over all probabilities of rules replacing one non-terminal is one
$$ \sum_{A} P(S \rightarrow A)=1 $$

Sum over all derivations is one
$$ \sum_{D \in S} P(D)=1 $$

Training

Input: Annotated training data (E.g.: Treebank)

Training

Rule extraction: Extract possible rules from the trees of the training data

Probability estimation

Assign probabilities of the rules

Maximum-likelihood estimation

Example:

Search

Find possible parses of the sentence

Statistical approach: Find all/many possible parse trees

Return most probable one

Strategies:

Top-Down

Bottom up

Shift reduce algorithm

Shift: advances in the input stream by one symbol. That shifted symbol becomes a new single-node parse tree.

Reduce: applies a completed grammar rule to some of the recent parse trees, joining them together as one tree with a new root symbol.

Example

Shift reducea algorithm example

Dynamic Programming

CYK Parsing

Avoid repeat work

Use Dynamic Programming

Transform grammar in Chomsky normal form

Store best trees for subphrases

Combine tree from best trees of subphrases

All rules must have the following form

A –> BC

A, B, C non-terminals

B, C not the start symbol

A –> a

A non-terminal

a terminal

S –> $\epsilon$

Create empty string if it is in the grammar

Every context-free grammar can be transferred into one having Chomsky normal form

Binarization

Only rules with two non-terminals

Idea:

Introduce additional non-terminal

Replace one rules with three non-terminals by two rules with two non- terminals each

Example

Remove unaries

Remove intermediate rules

Problems

Very strong indepedence assumption

Label is bottleneck

Example

Grammar

Analyse the sentence “she eats a fish with a fork” with the CYK algorithm:

result:

Transition-based Dependency Parsing

Model Dependency structure

Predict transition sequence: Transition between configuration

Arc-standard System

Configuration

Stack

Buffer

Set of Dependency Arcs

Initial configuration: [Root], $w_1,\dots, w_n$, {}

All words are in the buffer

The stack is empty

The dependency graph is empty

Terminal configuration

The buffer is empty

The stack contains a single word

Example

Transistions

Left-arc
$$ ([\sigma|i| j], B, A) \Rightarrow([\sigma \mid j], B, A \cup\{j, I, i\}) $$

Add dependency between top and second top element of the stack with label l to the arcs

Remove second top element from the stack

Right-arc
$$ ([\sigma|i| j], B, A) \Rightarrow([\sigma \mid i], B, A \cup\{i, I, j\}) $$

Add dependency between second top and top element of the stack with label l to the arcs

Remove top element from the stack

Shift: Move first elemnt of the buffer to the stack

Example

Initial configuration

Shift

Shift

Left arc

Shift

Shift

Left arc

Shift

Right arc

Right arc

Problems

Sparsity

Incompleteness

Expensive computation

Neural Network-based prediction

Feed forward neural network to predict operation

Inpupt

Set of words $S^w$, pos-tags $S^t$ adn labels $S^l$

Fixed number

Map to continuous space

Output

Operation

$2N_l + 1$

Example structure

Evaluation

Label precision/recall

Describe tree as set of triple (Label, start, end)

Calculate precision/recall/f-score of reference and hypothesis

Reference

Shift reduce algorithm

Shift reduce Parsing

Wiki

Transition-based dependency parsing

[CS224n笔记] L5 Dependency Parsing

CS224n, Linguistic Structure: Dependency Parsing

Summarization

Wed, 16 Sep 2020 00:00:00 +0000

TL;DR

Text summarization

Most important technique

Extraction

Tasks:

Key word extraction

Sentence extraction

Algorithms:

Supervised

Unsupervised

Abstract summarization still an open problem

Introduction

What is Summarization?

Reduce natural language text document

Goal: Compress text by extracting the most important/relevant parts 💪

Applications

Articles, news: Outlines or abstracts

Email / Email threads

Health information

Meeting summarization

Dimensions

Single vs. multiple

Single-document summarization

Given single document

Produce abstract, outline, headline, etc.

Multiple-document summarization

Given a group of documents

A series of news stories on the same event

A set of web pages about some topic or question

Generic vs. Query-focused summarization

Generic summarization: summarize the content of the document

Query-focused summarization: kind of complex question answering 🤪

Summarize a document with respect to an information need expressed in a user query

Longer, descriptive, more informative answers

Answer a question by summarizing a document that has information to construct the answer

Techniques

Extraction

Select subset of existing text segments

e.g.:

Sentence extraction

Key-phrase extraction

Simpler, most focus in research

Abstraction

Use natural language generation to create summary

More human like

Extractive summarization

Three main components

Content selection ("Which parts are important to be in the summary?")

Information ordering ("How to order summaries?")

Sentence realization (Clean up/Simplify sentences)

Supervised approaches

Key-word extraction

Given: Text (e.g. abstract of an article, …)

Task: Find most important key phrases

Computer Human

Select key phrases from the text Abstraction of the text

No new wordings New words

Key-phrase extraction using Supervised approaches

Given: Collection of text with key-words

Algorithm

Extract all uni-grams, bi-grams and tri-grams

Example

Extraction: Compatibility, Compatibility of, Compatibility of systems, of, of systems, of systems of, systems, systems of, systems of linear, of linear, of linear constraints, linear, linear constraints, linear constraints over, …

Annotate each examples with features

Annotate training examples with class:

1 if sequence is part of the key words

0 if sequence is not part of the key words

Train classifier

Create test examples and classify

Examples set

All uni-, bi-, and trigrams (except punctuation)

restrict to certain POS sets

🔴 Problem:

Enough examples to generate all/most key phrases

Too many examples -> low performance of classifier

Features

Term frequency

TF-IDF (Term Frequency–Inverse Document Frequency)

Reflect importance of a word in a document
$$ \text{TF-IDF} = tf * idf $$

$tf(w, D)$:

Term Frequency, measures how frequently a term occurs in a document.

Number of occurrences of word $w$ in document $d$ divided by the maximum frequency of one word in $D$

$$ t f(w, D)=\\#(w, D) \frac{\\#(w, D)}{\max\_{w^{\prime} \in D}\left(w^{\prime}, D\right)} $$

Alternative definition:
$$ > tf(w, D) = \frac{\text{count of } w \text{ in } D}{\text{number of words in } D} > $$

$idf(w)$:

Inverse Document Frequency, measures how important a term is

Idea: Words which occur in less documents are more important

Number of documents divided by the number of documents which contain $w$

$$ i d f(w)=\log \frac{|D|}{|\\{d \in D: w \in d\\}|} $$

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

Length of the example

Relative position of the first occurrence

Boolean syntactic features

contains all caps

Learning algorithm

Decision trees,

Naive Bayes classifier

…

Evaluation

Compare results to reference

Test set: Text

Human generated Key words

Metrics:

Precision

Recall

F-Score

🔴 Problems

Humans do not only extract key words, but also abstract

Normally not all key words are reachable

Sentence extraction

Use statistic heuristics to select sentences

Do not change content and meaning

💡 Idea

Use measure to determine importance of sentence

TF-IDF

Supervised trained combination of several features

Rank sentence according to metric

Output sentences with highest scores:

Fixed number

All sentence above threshold

Limitations: Do NOT change text (e.g. add phrases, delete parts of the text)

Evaluation

Idea: Compare automatic summary to abstract of text

Problem: Different sentences –> Nearly no exact match 😭

ROUGE - Recall-Oriented Understudy for Gisting Evaluation

Use also approximate matches

Compare automatic summary to human generated text

Given a document D and an automatic summary X

M humans produce a set of reference summaries of D

What percentage of the n-grams from the reference summaries appear in X?

ROUGE-N: Overlap of N-grams between the system and reference summaries
$$ \text{ROUGH-N} = \frac{\sum\_{S \in \\{\text{Reference Summaries}\\}} \sum\_{gram\_n \in S}\operatorname{Count}\_{match}(gram\_n)}{\sum\_{S \in \\{\text{Reference Summaries}\\}} \sum\_{gram\_n \in S}\operatorname{Count}(gram\_n)} $$

Example:

Auto-generated summary ($Y$)

the cat was found under the bed
Gold standard (human produced) ($X1$)

the cat was under the bed
1-gram and 2-gram summary:

# 1-gram reference 1-gram 2-gram reference 2-gram

1 the the the cat the cat

2 cat cat cat was cat was

3 was was was found was under

4 found under found under under the

5 under the under the the bed

6 the bed the bed

7 bed

count 7 6 6 5

$\operatorname{ROUGE}-1(X1, Y) = \frac{6}{6} = 1$

$\operatorname{ROUGE}-2(X1, Y) = \frac{4}{5}$

Unsupervised approaches

Problems of supervised approaches: Hard to acquire training data

We try to use unsupervised learning to find key phrases / sentences which are most important. But which sentences are most important?

💡 Idea: Sentences which are most similar to the other sentences in the text

Graph-based approaches

Map text into a graph

Nodes:

Text segments: Words

Edges: Similarity

Find most important/central vertices

Algorithms: TextRank / LexRank

Graph-based approaches : Key-phrase extraction

Graph

Nodes

Text segments: Words

Restriction:

Nouns

Adjective

Edges

items co-occur in a window of N words

Calculate most important nodes

Build Multi-word expression in post-processing

Mark selected items in original text

If two adjacent words are marked –> Collapse to one multi-words expression

Example

Graph-based approaches : Sentence extraction

Graph:

Nodes: Sentences

Edges: Fully connected with weights

Weights:

TextRank: Word overlap normalized to sentence length
$$ \text {Similarity}\left(S\_{i}, S\_{j}\right)=\frac{\left|\left\\{w\_{k} \mid w\_{k} \in S\_{i} \text{ & } w\_{k} \in S\_{j}\right\\}\right|}{\log \left(\left|S\_{i}\right|\right)+\log \left(\left|S\_{j}\right|\right)} $$

LexRank: Cosine Similarity of TF-IDF vectors
$$ \text { idf-modified-cosine }(x, y)=\frac{\sum\_{w \in x, y} \mathrm{tf}\_{w, x} \mathrm{tf}\_{w, y}\left(\mathrm{idf}\_{w}\right)^{2}}{\sqrt{\sum\_{x\_{i} \in x}\left(\mathrm{tf}\_{x\_{i}, x} \mathrm{idf}\_{x\_{i}}\right)^{2}} \times \sqrt{\sum\_{y\_{i} \in y}\left(\mathrm{tf}\_{y\_{i}, y} \mathrm{idf}\_{y\_{i}}\right)^{2}}} $$

Abstract summarization

Sequence to Sequence task

Input: Document

Output: Summary

Several NLP tasks can be modeled like this (ASR, MT,…)

Successful deep learning approach: Encoder-Decoder Model

Sequence-to-Sequence Model

Predict words based on

previous target words and

source sentence

Encoder: Read in source sentence

Decoder: Generate target sentence word by word

Encoder

Read in input: Represent content as hidden vector with fixed dimension

LSTM-based model

Fixed-size sentence representation

Details:

One–hot encoding

Word embedding

RNN layer(s)

Decoder

Generate output: Use output of encoder as input

LSTM-based model

Input last target word

Attention-based Encoder-Decoder

Attention-based Encoder : copy mechanism

Calculate probability “better to generate one word from vocabulary than to copy a word from source sentence“
$$ p\_{g e n}=\sigma\left(w\_{c}^{T} c\_{t}+w\_{s}^{T} s\_{t}+w\_{x}^{T} x\_{t}+b\_{p t r}\right) $$
Word with the highest probability should be the output word
$$ P(w)=p\_{g e n} P\_{v o c a b}(w)+\left(1-p\_{g e n}\right) \sum\_{j: w\_{j}=w} \alpha\_{i j} $$
Data

Training data

Documents and summary

DUC data set

News article

Around 14 word summary

Giga word

News articles

Headline generation

CNN/Mail Corpus

Article

Predict bullet points

Reference

TF-IDF:

TF-IDF算法详解

Question Answering

Fri, 18 Sep 2020 00:00:00 +0000

Definition

Question Answering

Automatically answer questions posed by humans in natural language

Give user short answer to their question

Gather and consult necessary information

Related topics

Information Retrieval

Reading Comprehension

Database Access

Dialog

Text Summarization

Problem Dimensions

Questions

Question class

Almost universally factoid questions E.g.: “What does the Peugeot company manufacture?”

More open in dialog context

Question domain

Topic of the content

Open-Domain: Any topic

Closed-Domain: Specific topic, e.g. movies, sports, etc

Context

How much context is provided?

Is search necessary?

Answer types

Factual Answers

Opinion

Summary

Kind of questions

Yes/No

“wh”-questions

Indirect requests (I would like to…)

Commands

Applications

Knowledge source types

Structured data (database)

Semi-structured data (e.g. Wikipedia tables)

Free text (e.g. Wikipedia text)

Knowledge source origins

Search over the web

Search of a collection

Single text

Domain

Domain-independent

Domain-specific system

Users

First time/casual users

Explain limitations

Power users

Emphasize novel information

Omit previously provided information

Answers

Long

Short

Lists

Narrative

Creation

Extraction

Generation

Evaluation

What is a good answer?

Should the answer be short or long?

Easier to have the answer in longer segments

Less concise, more comprehensive

Presentation

Underspecified question

Feedback

Too many documents

Text or speech input

Examples

TREC

SQuAD (Stanford Question Answering Dataset)

IBM Watson

Motivation

Vast amounts of information written by humans for humans

Computers are good at searching vast amounts of information

Natural interaction with computers 💪

System Approaches

Text-based system

Use information retrieval to search for matching documents

Knowledge-based approaches

Build semantic representation of the query

Retrieve answer from semantic databases (Ontologies)

Knowledge-rich / hybrid approaches

Combine both

QA System Overview

Components

Information Retrieval

Need to find good text segments

Answer Extraction

Given some context and the question, produce an answer

Either part may be supplemented by other NLP tools

Common Components

Preprocessing

Question Analysis

Input: Natural language question

Implicit input

Dialog state

User information

Derived inputs

POS-tags, NER, dependency graph, syntax tree, etc.

Output: Representation for Information Retrieval and Answer Extraction

For IR: Weighted vector or search term collection

For answer extraction

Lexical answer type (person/company/acronym/…)

Additional constraints (e.g. relations)

Answer Type Classification

Classical approach: Question word (who, what, where,…)

When: date

Who: person

Where: location

Examples

Regular expressions

Who {is | was | are | were } – Person

Question head word (First noun phrase after the question word)

Which city in China has the largest number of foreign financial companies?

What is the state flower of California?

🔴 Problems

“Who” questions could refer to e.g. companies

E.g. “Who makes the Beetle?”

Which / What is not clear

E.g. “What was the Beatles’ first hit single?”

Approaches

Manually created question type hierarchy

Machine learning classification

(Current ML systems often do NOT use Answer Type Classification 😂)

Constraints

Keyword extraction

Expand keywords using synonyms

Statistical parsing

Identify semantic constraints

Example

Represent a question as bag-of-words

“What was the monetary value of the Nobel Peace Price in 1989?”

monetary, value, Nobel, Peace, Price, 1989

“What does the Peugeot company manufacture?”

Peugeot, company, manufacture

“How much did Mercury spend on advertising in 1993?”

Mercury, spend, advertising, 1993

Retrieval: Candidate Document Selection

Most common approach:

Conventional Information Retrieval search

Using search indices

Lucene

TF-IDF

Several stages: Coarse-to-fine search

Result: Small set of documents for detailed analysis

Decisions: Boolean vs. rank-based engines

Retrieve only part of the document

Mostly only part of the document is important

Passage retrieval

Return only subsets of the document

Segment document into coherent text segments

Combine results from multiple search engines

Text-based system

Use only syntactic information such as n-grams

Example: TF-IDF (Term Frequency, Inverse Document frequency)

Weighted bag-of-words vector

One component per word in vocabulary

Term frequency: Number of times term appears in the document

Document frequency: Number of documents the term appears in

$$ \begin{array}{l} T F^{\prime}(d, t)=\log (1+T F(d, t)) \\\\ I D F(t)=\log \frac{n_{d}-D F(t)}{D F(t)} \\\\ T F I D F(d, t)=T F^{\prime}(d, t) I D F(t) \end{array} $$

Knowledge-based / semantic-based system

Build semantic representation by extracting information from the question

Construct structured query for semantic database

Not raw or indexed text corpus

Examples

WordNet

Wikipedia Infoboxes

FreeBase

Candidate Document Analysis

Named entity tagging

Often including subclasses (towns, cities, provinces, …)

Sentence splitting, tagging, chunk parsing

Identify multi-word terms and their variants

Represent relation constraints of the text

Answer Extraction

Input

Representations for candidate text segments and question

Rank set of candidate sentences

Expected answer type(s)

Find answer strings that match the answer type(s) based on documents

Extractive: Answers are substrings in the documents

Generative: Answers are free text (NLG)

Rank the candidate answers

E.g. overlap between answer and question

Return result(s) with best overall score

Example

Response Generation

Rephrase text segment

E.g. resolve anaphors

Provide longer or shorter answer

Add some part of context into the answer

If answer is too complex

Truncate answer

Start dialog

Neural Network Approach

Neural models struggle with Information Retrieval 🤪

Excellent results on answer extraction 😍

Given: Question and Context (document, paragraph, nugget, etc.)

Result: Answer as substring from context

Predict most likely start and end index as classification task

Combines:

Question Analysis

Retrieved Document Analysis

Answer Extraction

Response Generation

Neural Answer Extraction

Encoder-decoder model

Encoder

Answer prediction

Softmax output $i$ is probability that answer starts at token $i$

Mirrored setup for end probability

🔴 Problem: Relying on single vector for question encoding

Long range dependencies

Feedback at end of sequence

Vanishing gradients

Solution: Use MORE information from the question

–> Attention mechanism

Calculates weighted sum of question encodings

Weight is based on similarity between question encoding and context encoding

Different similarity metrics

Review of models see:

Natural/Spoken Language Understanding

Fri, 18 Sep 2020 00:00:00 +0000

Definition

Natural language understanding

Representing the semantics of natural language

Possible view: Translation from natural language to representation of meaning

Difficulties

Ambiguities

Lexical

Syntax

Referential

Vagueness

E.g., “I had a late lunch.”

Dimensons

Depth: Shallow vs Deep

Domain: Narrow vs Open

Examples

Siri (2011)

Dialog Modeling

Dialog system / Conversational agent

Computer system that converse with a human

Coherent structure

Different modalities:

Text, speech, graphics, haptics, gestures

Components

Input recognition

Different modalities

Automatic speech recognition (ASR)

Gesture recognition

Transformation

Input modality (e.g. speech) –> text

May introduce first errors

High influence on the performance of an dialog system

Natural language understanding (NLU)

Semantic interpretation of written text

Transformation from natural language to semantic representation

Representations:

Deep vs Shallow

Domain-dependent vs. domain independent

Dialog manager (DM)

Manage flow of conversation

Input: Semantic representation of the input

Output: Semantic representation of the output

Utilize additional knowledge

User information

Dialog History

Task-specific information

Natural language generation (NLG)

Generate natural language from semantic representation

Input: Semantic output representation of the dialog manager

Output: Natural language text for the user

Output rendering

Generate correct output

e.g. Text-to-Speech (TTS) for Spoken Dialog Systems

Natural Language understanding

Approaches

Output representation

Relation instances

(Larry Page, founder, Google)

Logical forms

Love(Mary, John)

Scalar

Positive/Negative 0.9

Vector

Hidden representation/ Word embeddings

Algorithms

Rule-based / Template

Machine learning

Conditional random fields

Support Vector Machine

Neural Networks / Deep learning

Semantic Parsing

Parse natural language sentence into semantic representation

Machine learning approaches most successful 👏

Most common approach:

Shallow Semantic Parsing / Semantic Role Labeling

Most important resources:

PropBank

FrameNet

PropBank

Proposition Bank (PropBank)

Labels for all sentence in the English Penn TreeBank

Defines semantic based on the verbs of the sentence

Verbs: Define different senses of the verbs

Sense: Number of Arguments important to this sense (Often only numbers)

Arg0: Proto-Agent

Arg1: Proto-Patient

Arg2: mostly benefactive, instrument, attribute, or end state

Arg3: start point, benefactive, instrument, or attribute

Example: “agree”

Example: “fall”

PropBank ArgM

TMP : when? yesterday evening, now

LOC : where? at the museum, in San Francisco

DIR : where to/from? down, to Bangkok

MNR : how? clearly, with much enthusiasm

PRP/CAU : why? because … , in response to the ruling

REC : themselves, each other

ADV : miscellaneous

PRD : secondary predication …ate the meat raw

🔴 Problem

Different words, Predicate expressed by noun

Example

More see: SRL数据集(1): Proposition Bank 数据集介绍

FrameNet

Roles based on Frames

Frame: holistic background knowledge that unites these words

Frame-elements: Frame-specific semantic roles

Example 1

Example 2

Semantic Role labeling

Task: Automatically finding semantic roles for each argument of each predicate

Approach: Maching Learning

High level algorithm:

Parse sentence (Syntax tree)

Find Predicates

For every node in tree: Decide semantic role

Spoken Language understanding

Natural language processing for spoken input

Difficulties

Less grammatically speech

Partial Sentences

Disfluencies (Self correction, hesitations, repetitions)

Robust to noise

ASR errors

Techniques:

Confidence

Multiple hypothesis

No Structure information

Punctuation

Text segmentation

Approach

Transform text into task-specific semantic representation of the user’s intention

Subtasks

Domain detection

Intention determination

Slot filling

Domain Detection

Motivated by Call Centers

Many agents with specialization on one topic (Billing inquiries, technical support requests, sales inquiries, etc.)

First techniques: Menus to find appropriate agent

Automatic task:

Given the utterance find the correct agent Utterance classification task

Utterance classification task

Input: Utterance

Output: Topic

Intention determination

Domain-dependent utterance classes

e.g. Find_Flight

Task: Assign class to Utterance

Use similar technique

Slot filling

Sequence labeling task: Assign semantic class label to every word and history

History: previous words and labels

Example:

Success of deep learning in other approaches:

RNN-based approach

Find most probable label given word and history

Dialog Management

Sun, 20 Sep 2020 00:00:00 +0000

Dialog Modeling

Dialog manager

Manage flow of conversation

Input: Semantic representation of the input

Output: Semantic representation of the output

Utilize additional knowledge

User information

Dialog History

Task-specific information

🔴 Challenges

Consisting of many different components

Each component has errors

More components –> less robust

Should be modular

Need to find unambiguous representation

Hard to train from data

Dialog Types

Goal-oriented Dialog

Follows a fixed (set of) goals

Ticket vending machines

Restaurant reservation

Car SDS

Aim: Reach goal as fast as possible

Main focus of SDS research

Social Dialog

Social Dialog / Conversational Bots / Chit-Chat Setting

Most human

Small talk conversation

Aims:

Generate interesting, coherent, meaningful responses

Carry-on as long as possible

Be a companion

Dialog Systems

Initiative

System Initiative

Command & control

Example (U: User, S: System)

Mixed Initiative

Most nature

Example

User Initiative

User most powerful

Error-prone

Example

Confirmation

Explicit verification

Implicit verification

Alternative verification

Development

Rule-based

Create management by templates/rules

Statistical

Train model to predict answer given input

POMDP

End-to-End Neural Models

No separation into NLU/DM/NLG

Components

Dialog Model: contains information about

whether system, user or mixed initiative?

whether explicit or implicit confirmation?

what kind of speech acts needed?

User Model: contains the system’s beliefs about

what the user knows

the user’s expertise, experience and ability to understand the system’s utterances

Knowledge Base: contains information about

the world and the domain

Discourse Context: contains information about

the dialog history and the current discourse

Reference Resolver

performs reference resolution and handles ellipsis

Plan Recognizer and Grounding Module

interprets the user’s utterance given the current context

reasons about the user’s goals and beliefs

Domain Reasoner/Planner

generates plans to achieve the shared goals

Discourse Manager

manages all information of dialog flow

Error Handling

errors or misunderstandings detection and recovery

Rule-based Systems

Finite State-based

💡 Idea: Iterate though states that define actions

Dialog flow:

specified as a set of dialog states (stages)

transitions denoting various alternative paths through the dialog graph

Nodes = dialogue states (prompts)

Arcs = actions based on the recognized response

Example

👍 Advantages

Simple to construct due to simple dialog control

The required vocabulary and grammar for each state can be specified in advance

Results in more constrained ASR and SLU

👎 Disadvantages

Restrict the user’s input to predetermined words/phrases

Makes the correction of misrecognized items difficult

Inhibits the user’s opportunity to take the initiative and ask questions or introduce new topics

Frame-based

💡 Idea: Fill slots in a frame that defines the goal

Dialog flow:

is NOT predetermined, but depends on

the contents of the user’s input

the information that the system has to elicit

Example

Eg1

Eg2

Slot(/Form/Template) filling

One slot per piece of information

Takes a particular action based on the current state of affairs

Questions and other prompts

List of possibilities

conditions that have to be true for that particular question or prompt

👍 Advantages

User can provide over-informative answers

Allows more natural dialogues

👎 Disadvantages

Cannot handle complex dialogues

Agent-based

💡 Idea:

Communication viewed as interaction between two agents

Each capable of reasoning about its own actions and beliefs

also about other’s actions and beliefs

Use of “contexts”

Example

Allow complex communication between the system, the user and the underlying application to solve some problem/task

Many variants depends on particular aspects of intelligent behavior included

Tends to be mixed-initiative

User can control the dialog, introduce new topics, or make contribution

👍 Advantages

Allow natural dialogue in complex domains

👎 Disadvantages

Such agents are usually very complex

Hard to build 😢

Limitations of Rule-based DM

Expensive to build Manual work

Fragile to ASR errors

No self-improvement over time

Statistical DM

Motivation

User intention can ONLY be imperfectly known

Incompleteness – user may not specify full intention initially

Noisiness – errors from ASR/SLU

Automatic learning of dialog strategies

Rule based time consuming

👍 Advantages

Maintain a distribution over multiple hypotheses for the correct dialog state

Not a single hypothesis for the dialog state

Choose actions through an automatic optimization process

Technology is not domain dependent

same technology can be applied to other domain by learning new domain data

Markov Decision Process (MDP)

A model for sequential decision making problems

Solved using dynamic programming and reinforcement learning

MDP based SDM: dialog evolves as a Markov process

Specified by a tuple $(S, A, T, R)$

$S$: a set of possible world states $s \in S$

$A$: a set of possible actions $a\in A$

$R$: a local real-valued reward function

$$ R: S \times A \mapsto \mathcal{R} $$

$T$: a transition mode
$$ T(s\_{t-1}, a\_{t-1}, s\_t) = P(s\_t | s\_{t-1}, a\_{t-1}) $$

🎯 Goal of MDP based SDM: Maximize its expected cumulative (discounted) reward
$$ E\left(\sum\_{t=0}^{\infty} \gamma^{t} R\left(s\_{t}, a\_{t}\right)\right) $$

Requires complete knowledge of $S$ !!!

Reinforcement Learning

“Learning through trial-and-error” (reward/penalty)

🔴 Problem

No direct feedback

Only feedback at the end of dialog

🎯 Goal: Learn evaluation function from feedback

💡 Idea

Initial all operations have equal probability

If dialog was successful –> all operations are positive

If dialog was negative –> operations negative

How RL works?

There is an agent with the capacity to act

Each action influences the agent’s future state

Success is measured by a scalar reward signal

In a nutshell:

Select actions to maximize future reward

Ideally, a single agent could learn to solve any task 💪

Sequential Decision Making

🎯 Goal: select actions to maximize total future reward

Actions may have long term consequences

Reward may be delayed

It may be better to sacrifice immediate reward to gain more long-term reward 🤔

Agent and Environment

At each step $t$

Agent:

Receives state $s\_t$

Receives scalar reward $r\_t$

Executes action $a\_t$

The environment:

Receives action $a\_t$

Emits state $s\_t$

Emits scalar reward $r\_t$

The evolution of this process is called a Markov Decision Process (MDP)

Supervised Learning Vs. Reinforcement Learning

Supervised Learning:

Label is given: we can compute gradients given label and update our parameters

Reinforcement Learning

NO label given: instead we have feedback from the environment

Not an absolute label / error. We can compute gradients, but do not yet know if our action choice is good. 🤪

More see: Deep Reinforcement Learning: Pong from Pixels

Policy and Value Functions

Policy $\pi$ : a probability distribution of actions given a state
$$ a = \pi(s) $$

Value function $Q^\pi(s, a)$ : the expected total reward from state $s$ and action $a$ under policy $\pi$
$$ Q^{\pi}(s, a)=\mathbb{E}\left[r\_{t+1}+\gamma r\_{t+2}+\gamma^{2} r\_{t+3}+\cdots \mid s, a\right] $$

“How good is action $a$ in state $s$?”

Same reward for two actions, but different consequences down the road

Want to update our value function accordingly

Appoaches to RL

Policy-based RL

Search directly for the optimal policy $\pi^\*$

(policy achieving maximum future reward)

Value-based RL

Estimate the optimal value function $Q^{∗}(s,a)$ (maximum value achievable under any policy)

Q-Learning: Learn Q-Function that approximates $Q^{∗}(s,a)$

Maximum reward when taking action $a$ in $s$

Policy: Select action with maximal $Q$ value

Algorithm:

Initialized $Q$ randomly

$Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha\left(r\_{t}+\gamma \cdot \underset{a}{\max} Q\left(s\_{t+1}, a\right)\right)$

Goal-oriented Dialogs: Statistical POMDP

POMDP : Partially Observable Markov Decision Process

MDP –> POMDP: all states $s$ cannot observed

POMDP based SDM –> reinforcement learning + belief state tracking

dialog evolves as a Markov process $P(s\_t | s\_{t-1}, a\_{t-1})$

$s\_t$ is NOT directly observable

–> belief state $b(s\_t)$: prob. distribution of all states

SLU outputs a noisy observation $o\_t$ of the user input with prob. $P(o\_t|s\_t)$

Specified by tuple $(S, A, T, R, O, Z)$

$S, A, T, R$ constitute an MDP

$O$: a finite set of observations received from the environment

$Z$: the observation function s.t.
$$ Z(o\_t,s\_t,a\_{t-1}) = P(o\_t|s\_t,a\_{t-1}) $$

Local reward is the expected reward $\rho$ over belief states
$$ \rho(b, a)=\sum\_{s \in S} R(s, a) \cdot b(s) $$

Goal: maximize the expected cumulative reward.

Operation (at each time step)
- World is in unobserved state $s\_t$

Maintain distribution over all possible states with $b\_t$
$$ b\_t(s\_t) = \text{Probability of being in state } s\_t $$

DM selects action $a\_t$ based on $b\_t$

Receive reward $r\_t$

Transition to unobserved state $s\_{t+1}$ ONLY depending on $s\_t$ and $a\_t$

Receive obserservation $o\_{t+1}$ ONLY depending on $a\_t$ and $s\_{t+1}$

Update of belief state
$$ b\_{t+1}\left(s\_{t+1}\right)=\eta P\left(o\_{t+1} \mid s\_{t+1}, a\_{t}\right) \sum\_{s\_{t}} P\left(s\_{t+1} \mid s\_{t}, a\_{t}\right) b\_{t}\left(s\_{t}\right) $$

Policy $\pi$:
$$ \pi(b) \in \mathbb{A} $$

Value function:
$$ V^{\pi}\left(b\_{t}\right)=\mathbb{E}\left[r\_{t}+\gamma r\_{t+1}+\gamma^{2} r\_{t+2}+\ldots\right] $$

POMDP model

Two stochastic models

Dialogue model $M$

Transition and observation probability model

In what state is the dialogue at the moment

Policy Model $\mathcal{P}$

What is the best next action

Both models are optimized jointly

Maximize the expect accumulated sum of rewards

Online: Interaction with user

Offline: Training with corpus

Key ideas

Belief tracking

Represent uncertainty

Pursuing all possible dialogue paths in parallel

Reinforcement learning

Use machine learning to learn parameters

🔴 Challenges

Belief tracking

Policy learning

User simulation

Belief state

Information encoded in the state
$$ \begin{aligned} b\_{t+1}\left(g\_{t+1}, u\_{t+1}, h\_{t+1}\right)=& \eta P\left(o\_{t+1} \mid u\_{t+1}\right) \\\\ \cdot & P\left(u\_{t+1} \mid g\_{t+1}, a\_{t}\right) \\\\ \cdot & \sum_{g\_{t}} P\left(g\_{t+1} \mid g\_{t}, a\_{t}\right) \\\\ \cdot & \sum_{h\_{t}} P\left(h\_{t+1} \mid g\_{t+1}, u\_{t+1}, h\_{t}, a\_{t}\right) \\\\ \cdot & b\_{t}\left(g\_{t}, h\_{t}\right) \end{aligned} $$

User goal $g\_t$: Information from the user necessary to fulfill the task

User utterance $u\_t$

What was said

Not what was recognized

Dialogue history $h\_t$

Using independence assumptions

Observation model: Probability of observation $o$ given $u$

Reflect speech understanding errors

User model: Probability of the utterance given previous output and new state

Goal transition model

History model

Model still too complex 🤪

Solution

n-best approach

Factored approach

Combination is possible

Policy

Mapping between belief states and system actions

🎯 Goal: Find optimal policy π’

Problem: State and action space very large

But:

Small part of belief space only visited

Plausible actions at every point very restricted

Summary space: Simplified representation

🔴 Disadvantages

Predefine structure of the dialog states

Location

Price range

Type of cuisine

Limited to very narrow domain

Cannot encode all features/slots that might be useful

Neural Dialog Models

End-to-End training

Optimize all parameters jointly

Continuous representations

No early decision

No propagation of errors

Challenges

Representation of history/context

Policy- Learning

Interactive learning

dIntegration of knowledge sources

Datasets

Goal oriented

bAbI task

Synthetic data – created by templates

DSTC (Dialog State tracking challenge)

Restaurant reservation

Collected using 3 dialog managers

Annotated with dialog states

Social dialog

Learn from human-human communication

Architecture

Memory Networks

Neural network model

Writing and reading from a memory component

Store dialog history

Learn to focus on important parts

Sequence-to-Sequence Models: Encoder-Decoder

Encoder

Read in Input

Represent content in hidden fix dimension vector

LSTM-based model

Decoder

Generate Output

Use fix dimension vector as input

LSTM-based model

EOS symbol to start outputting

Example

Recurrent-based Encoder-Decoder Architecture

Trained end-to-end.

Encoder

Decoder

Dedicated Dialog Architecture

Training

Supervised learning

Supervised: Learning from corpus

Algorithm:

Input user utterance

Calculate system output

Measure error

Backpropagation error

Update weights

Problem:

Error lead to different dialogue state

Compounding errors

Imitation learning

Imitation learning

Interactive learning

Correct mistakes and demonstrate expected actions

Algorithm: same as supervised learning

Problem: costly

Deep reinforcement learning

Imitation learning

Interactive learning

Feedback only at end of the dialogue

Successful/ Failed task

Additional reward for fewer steps 👏

Challenge:

Sampling of different actions

Hugh action space

Natural Language Generation

Sat, 19 Sep 2020 00:00:00 +0000

Motivation

🎯 Goal: generate natural language from semantic representation (or other data)

Examples

Pollen Forecast

Pollen Forecast for Scotland

Taking six numbers as input, a simple NLG system generates a short textual summary of pollen levels

“Grass pollen levels for Friday have increased from the moderate to high levels of yesterday with values of around 6 to 7 across most parts of the country. However, in Northern areas, pollen levels will be moderate with values of 4.”

The actual forecast (written by a human meteorologist) from the data

“Pollen counts are expected to remain high at level 6 over most of Scotland, and even level 7 in the south east. The only relief is in the Northern Isles and far northeast of mainland Scotland with medium levels of pollen count.”

Weather Forecast

Function: Produces textual weather reports in English and French

Input: Numerical weather simulation data annotated by human forecaster

Difficulties/Challenges

Making choices

Content to be included/omitted

Organization of content into coherent structure

Style (formality, opinion, genre, personality…)

Packaging into sentences

Syntactic constructions

How to refer to entities (referring expression generation)

What words to use (lexical choice)

Rule-based methods

Six basic activities in NLG:

Content determination

Deciding what information to mention in the text

Discourse planning

Imposing ordering and structure over the information to convey

Sentence aggregation

Merging of similar sentences to improve readability and naturalness

Lexicalization

Deciding the specific words and phrases to express the concepts and relations

Referring expression generation

Selecting words or phrases to identify domain entities

Linguistic realization

Creating the actual text, which is correct according to the grammar rules of syntax, morphology and orthography

3-stages pipelined architecture:

Text planning (Act 1 and 2)

Sentence planning (Act 3, 4, and 5)

Linguistic realization (Act 6)

Intermediate representations: Text plans

Represented as trees whose leaf nodes specify individual messages and internal nodes show how messages are conceptually grouped

Sentence plans

Template representation, possibly with some linguistic processing → Represent sentences as boilerplate text and parameters that need to be inserted into the boilerplate text

abstract sentential representation → Specify the content words (nouns, verbs, adjectives and adverbs) of a sentence, and how they are related

Text/Document planner

Determine

what information to communicate

how to structure information into a coherent text

Common Approaches:

methods based on observations about common text structures (Schemas)

methods based on reasoning about the purpose of the text and discourse coherence (Rhetorical Structure Theory, planning)

Content Selection

Text is sequence of MESSAGES, predefined data structures:

correspond to informational units in the text

collect together underlying data in ways that are convenient for linguistic expression

How to devise MESSAGE types?

Rhetorical predicates: generalizations made by linguists

From corpus analysis, identify agglomerations of informational elements

Application dependent

Rhetorical predicates

Attribute

E.g. Mary has a pink coat.

Equivalence

E.g. Wines described as ‘great’ are fine wines from an especially good village.

Specification

E.g. [The machine is heavy.] It weighs 2 tons.

Constituency

E.g. [This is an octopus.] There is his eye, these are his legs, and he has these suction cups.

Evidence

E.g. [The audience recognized the difference.] They started laughing right from the very first frames of that film.

…

Corpus-based content selection

(Take weather forecast as example)

Routine messages: always included

E.g.

MonthlyRainFallMsg

MonthlyTemperatureMsg

RainSoFarMsg

MonthlyRainyDaysMsg

Significant Event messages: Only constructed if the data warrants it

E.g. if rain occurs on more than a specified number of days in a row

RainEventMsg

RainSpellMsg

TemperatureEventMsg

Example

Define Schemas

Produces a text/document plan

a tree structure populated by messages at its leaf nodes

Aggregation

Deciding how messages should be composed together to produce specifications for sentences or other linguistic units

On the basis of

Information content

Possible forms of realization

Semantics

Some possibilities:

Simple conjunction

Ellipsis

Embedding

Set introduction

Example

Without aggregation:

Heavy rain fell on the 27th. Heavy rain fell on the 28th.

Aggregation via simple conjunction:

Heavy rain fell on the 27th and heavy rain fell on the 28th.

Aggregation via ellipsis:

Heavy rain fell on the 27th and [] on the 28th.

Aggregation via set introduction:

Heavy rain fell on the 27th and 28th.

Lexicalization

Choose words and syntactic structures to express content selected

If several lexicalizations are possible, consider:

user knowledge and preferences

consistency with previous usage

Pragmatics: emphasis, level of formality, personality, …

interaction with other aspects of micro planning

Example

S: rainfall was very poor

NP: a much worse than average rainfall

ADJP: much drier than average

Generating Referring Expressions (GRE)

Identify specific domain objects and entities

GRE produces description of object or event that allows hearer to distinguish it from distractors

Issues

Initial introduction of an object

Subsequent references to an already salient object

Example

Referring to months:

June 1999

June

the month

next June

Referring to temporal intervals

8 days starting from the 11th

From the 11th to the 18th

(Relatively simple, so can be hardcoded in document planning)

Realization

🎯 Goal: to convert text specifications into actual text

Purpose: hide the peculiarities of the target language from the rest of the NLG system

Example

Evaluation

Task-based (extrinsic) evaluation

how well the generated text helps to perform a task

Human ratings

quality and usefulness of the text

Metrics

e.g. BLEU (Bilingual Evaluation Understudy)

Quality is considered to be the correspondence between machine’s output and that of a human

Statistical methods

Problems of conventional NLG components

expensive to build

need lots of handcrafting or a well-labeled dataset to be trained on

kind and amount of available data severely limits the development 😢

makes cross-domain, multi-lingual SDSs (Spoken Dialogue Systems) intractable 😢

Motivation

human languages are context-aware

natural response should be directly learned from data than depending on defined syntaxes or rules

Deep Learning NLG

Significant progress in applying statistical method for SLU and DM in past decade

including making them more easily extensible to other application/domains

Data-driven NLG for SDSs relatively unexplored due to mentioned difficulty of collecting semantically-annotated corpora

rule-based NLG remains the norm for most systems

Goal of the NLG component of an SDS:

map an abstract dialog act consisting of an act type and a set of attribute(slot)-value pairs into an appropriate surface text

(RNN-based) Generation

Conditional text generation

Text has different length

Use RNN-based neural network

Decoding

Initialize RNN with input

Hidden state or first input

Generate output probability for first word

Sample first word/Select most probable word

Insert selected word into RNN

Continue till <eos>

🔴 Challenges

Large vocabulary

Names of all restaurants

Delexicalization: Replace slot values by slot names

Vanishing gradient

Repeated input

Gating of input vector

Problem: Output NAME several times

Remove NAME from S when it has been output

Only backward dependencies

Rerank output with different models

N-Best list reranking

Cannot look at all possible output

But: Generate several good outputs (e.g. top 10; top 100)

Then we can also use other models to evaluate them

Possible to select different one

But if good output is not in best, we can not find it 🤪

N-Best generation

Beam search

Select top $k$ words at timestep 1

Independently insert all of them at timestep 2

Select top $k$ words

$k*k$ possible output at timestep 2

Filter top $k$

Continue with top $k$ at timestep 3

Right to left

Rescoring

Inverse direction

Left to write decoding

RNN allows generation from left-to-right

👍 Advantages

Do not need to generate all possible output and then evaluate

Possible for most task

👎 Disadvantages

No global view

Word probability only on previous words

Non optimal modeling if all slots have been filed

Generating long sequence

RNN prefers short sequences –> Hard to train long sequences 😢

Incoherent E.g. The sun is the center of the sun

Redundant E.g. I like cake and cake

Contradictory E.g. I don’t own a gun, but I do own a gun

💡 Idea:

Generate only fix length segments

Condition on input and previous target sequence

Generating by editing

Similar sentence should be in the training data

Edit this sentence instead of generating new sentence

💡Idea

Find similar sentence

Combine edit vector and input sentence

Generate output sentence

Use sequence to sequence model

Again RNN

But easier to copy then to generate

Information Retrieval

Sun, 20 Sep 2020 00:00:00 +0000

Overview

Information Retrieval (IR):

finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Use case / applications

web search (most common)

E-mail search

Searching your laptop

Corporate knowledge bases

Legal information retrieval

Basic idea

Collection: A set of documents

Assume it is a static collection for the moment

🎯 Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task

Main idea

Compare document and query to estimate relevance

Components

Representation: How to represent the document and query

Metric: How to compare document and query

Evaluation of retrieved docs

“How good are the retrieved docs?”

Precision: Fraction of retrieved docs that are relevant to the user’s information need

Recall: Fraction of relevant docs in collection that are retrieved

Logic-based IR

Find all text containing words

Allow boolean operations between words

Representation: Words occurring in the document

Metric: Matching (with Boolean operations)

Limitations

Only exact matches

No relevance metric 🤪

Primary commercial retrieval tool for 3 decades.

Many search systems you still use are Boolean:

Email

library catalog

Mac OSX Spotlight

Example

“Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?”

One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia

But this is not the answer 😢

Slow (for large corpora)

NOT Calpurnia is non-trivial

Other operations (e.g.,find the word Romans near countrymen) not feasible

Incidence vectors

0/1 vector for each term

To answer the query in the example above:

take the vectors for Brutus, Caesar and Calpurnia (complemented), then bitwise AND.

Brutus: 110100 AND

Caesar: 110111 AND

complemented Calpurnia: 101111

= 100100

However, this is not feasible for large collection! 😭

More see: An example information retrieval problem

Inverted index

For each term $t$, store a list of all documents that contain $t$.

Identify each doc by a docID, a document serial number

Construction

Collect the documents to be indexed

Tokenize the text, turning each document into a list of tokens

Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms

Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings

Example

More see: A first take at building an inverted index

Initial stages of text processing

Tokenization: Cut character sequence into word tokens

Normalization: Map text and query term to same form

E.g. We want U.S.A and USA to match

Stemming: different forms of a root to match

E.g. authorize and authorization should match

Stop words: we may omit very common words (or not)

E.g. the, a, to, of…

Query processing: AND

For example, consider processing the query: Brutus AND Caesar

Locate Brutus in the Dictionary

Retrieve its postings

Locate Caesar in the Dictionary

Retrieve its postings

“Merge” the two postings (intersect the document sets)

Walk through the two postings simultaneously, in time linear in the total number of postings entries

(If the list lengths are $x$ and $y$, the merge takes $O(x+y)$ operations.)

‼️Crucial: postings sorted by docID

Phrase queries

E.g. We want to be able to answer queries such as “stanford university” as a phrase

–> The sentence “I went to university at stanford” is not a match.

Implementation:

Multi-words

Position index

Rank-based IR

Motivation

Boolean queries: Documents either match or don’t.

Good for:

expert users with precise understanding of their needs and the collection.

applications: Applications can easily consume 1000s of results.

NOT good for the majority of users

Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).

Most users don’t want to wade through 1000s of results.

🔴 Problem: feast of famine

Often result in either too few (=0) or too many (1000s) results.

It takes a lot of skill to come up with a query that produces a manageable number of hits.

AND gives too few;

OR gives too many

Ranked retrieval models

Returns an ordering over the (top) documents in the collection for a query

Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language

Large result sets are not an issue

Indeed, the size of the result set is not an issue

We just show the top $k$ (≈10) results

We don’t overwhelm the user

Premise: the ranking algorithm works

Representation:

Term weights (TF-IDF)

Word embeddings

Char Embeddings

Metric

Cosine similarity

Supervised trained classifier using clickthrough logs

Document similarity

Query-document matching scores

Assigning a score to a query/document pair

One-term query

If the query term does not occur in the document: score should be 0

The more frequent the query term in the document, the higher the score (should be)

Binary term-document incidence matrix

Each document is represented by a binary vector $\in \\{0, 1\\}^{|V|}$

Term-document count matrices

Consider the number of occurrences of a term in a document

Each document is a count vector in Nv: a column below

Term frequency tf

Term frequency of term $t$ in document $d$
$$ \text{tf}\_{t,d}:= \text{number of timest that } t \text{ occurs in } d $$

We want to use tf when computing query-document match scores

Log-frequency weighting

Log frequency weight of term $t$ in $d$

$$ w\_{t,d} = \begin{cases} 1 + \log\_{10}\text{tf}\_{t, d}& \text{if } \text{tf}\_{t,d}>0 \\\\ 0 & \text {otherwise }\end{cases} $$

Score for a document-query pair: sum over terms $t$ in both $q$ and $d$ $$ \text{score} = \sum\_{t \in q \cap d}(1 + \log \text{tf}\_{t,d}) $$

Document frequency

💡 Rare terms are more informative than frequent terms

$\text{df}\_t$: Document frequency of $t$

The number of documents that contain $t$

Inverse measure of the informativeness of $t$

$\text{df}\_t \leq N$

$idf$: inverse document frequency of $t$
$$ \text{idf}\_t = \log\_{10}(\frac{N}{\text{df}\_t}) $$
(use $\log (N/\text{df}\_t)$ instead of $N/\text{df}\_t$ to “dampen” the effect of $\text{idf}$)

Collection frequency of $t$: the number of occurrences of $t$ in the collection, counting multiple occurrences.

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight
$$ \mathrm{w}\_{t, d}=\log \left(1+\mathrm{tf}\_{t, d}\right) \times \log \_{10}\left(N / \mathrm{df}\_{t}\right) $$

Best known weighting scheme in information retrieval

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection

Example

Each document is now represented by a real-valued vector of tf-idf weights $\in \mathbb{R}^{|V|}$

Documents as vectors

$|V|$-dimensional vector space

Terms are axes of the space

Documents are points or vectors in this space

Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine! 😱

Very sparse vectors (most entries are zero)

Distributional similarity based representations

Get a lot of value by representing a word by means of its neighbors

“You shall know a word by the company it keeps”

Low dimensional vectors

The number of topics that people talk about is small

💡Idea: store “most” of the important information in a fixed, small number of dimensions: a dense vector (Usually 25 – 1000 dimensions)

Reduce the dimensionality: Go from big, sparse co-occurrence count vector to low dimensional “word embedding”

Traditional Way: Latent Semantic Indexing/Analysis

Use Singular Value Decomposition (SVD)

Similarity is preserved as much as possible

DL methods

Word representation in neural networks:

1-hot vector

Sparse representation

NN learn continuous dense representation

Word embeddings

End-to-End learning

Pre-training using other task

Word embeddings

Predict surrounding words

E.g. Word2Vec, GloVe

Document representation:

TF-IDF Vectors: Sum of word vectors

Word embeddings: Sum or average of word vectors

🔴 Problems

High dimension

Unseen words: Not possible to represent words not seen in training

Morphology: No modelling of spelling similarity

Letter n-grams

Mark begin and ending

E.g. #good#

Letter tri-grams

E.g. #go, goo, ood, od#

🔴 Problem:

Collision: Different words may be represented by same trigrams

Measure similarity

Rank documents according to their proximity to the query in this space

proximity = similarity of vectors

proximity ≈ inverse of distance

(Euclidean) Distance is a bad idea!

Euclidean distance is large for vectors of different lengths

Use angle instead of distance

💡 Key idea: Rank documents according to angle with query.

From angles to cosines

As Cosine is a monotonically decreasing function for the interval $[0^{\circ}, 180^{\circ}]$

The following two notions are equivalent:

Rank documents in decreasing order of the angle between query and

document

Rank documents in increasing order of $\operatorname{cosine}(\text{query},\text{document})$

Length normalization

Dividing a vector by its $L\_2$ norm makes it a unit (length) vector (on

surface of unit hypersphere)
$$ \|\vec{x}\|\_{2}=\sqrt{\sum x\_{i}^{2}} $$
–> Long and short documents now have comparable weights

$\operatorname{cosine}(\text{query},\text{document})$
$$ \cos (\vec{q}, \vec{d})=\frac{\vec{q} \cdot \vec{d}}{|\vec{q}||\vec{d}|}=\frac{\vec{q}}{|\vec{q}|} \cdot \frac{\vec{d}}{|\vec{d}|}=\frac{\sum_{i=1}^{|V|} q_{i} d_{i}}{\sqrt{\sum_{i=1}^{|V|} q_{i}^{2}} \sqrt{\sum_{i=1}^{|V|} d_{i}^{2}}} $$

$q\_i$: e.g. the tf-idf weight of term i in the query

$d\_i$: e.g. the tf-idf weight of term i in the document

Illustration example

Link information

Hypertext and links

Questions

Do the links represent a conferral of authority to some pages? Is this useful for ranking?

Application

The Web

Email

Social networks

Links

The Good, The Bad and The Unknown

Good nodes won’t point to Bad nodes

All other combinations plausible

If you point to a Bad node, you’re Bad

If a Good node points to you, you’re Good

Web as a Directed Graph

Hypothesis 1: A hyperlink between pages denotes a conferral of authority (quality signal)

Hypothesis 2: The text in the anchor of the hyperlink on page A describes the target page B

Anchor Text

Assumptions

reputed sites

annotation of target

Indexing: When indexing a document D, include (with some weight) anchor text from links pointing to D.
- Can sometimes have unexpected effects, e.g., spam, **miserable failure** 🤪 - Solution: score anchor text with weight depending on the **authority** of the anchor page’s website - *E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust (more) the anchor text from them*

Link analysis: Pagerank

Citation Analysis

Citation frequency

Bibliographic coupling frequency: Articles that co-cite the same articles are related

Citation indexing

Pagerank scoring

Imagine a user doing a random walk on web pages:

Start at a random page

At each step, go out of the current page along one on the links on that page, equiprobably

“In the long run” each page has a long-term visit rate - use this as the page’s score.

But the web is full of dead-ends.

Random walk can get stuck in dead-ends. 😢

Makes no sense to talk about long-term visit rates.

At a dead end, jump to a random web page.

At any non-dead end, with probability 10%, jump to a random web page.

Result of teleporting

Now cannot get stuck locally.

There is a long-term rate at which any page is visited

Language and Vision

Sun, 20 Sep 2020 00:00:00 +0000

Motivation

Human interacts with environment multimodal

Modalities

Text

Audio

Vision

Other modalities can be used to disambiguate text

Jointly using different modalities

Image description

Generation

Generate description/caption of image

Verbalize the most salient aspects of the image

Typically one sentence

Example

Joint use of

Computer vision

Natural language processing

🔴 Challenges

Cover any visual aspect of the image:

Objects and their attributes

Features of the scene

Interaction of objects

Reference to objects not in the image:

E.g. people waiting for a train

Background knowledge necessary

E.g. Picture of Mona Lisa

Task

Input: Image

Generate representation

Output: Text

Related to Natural language generation

Content selection

Organizing of content

Surface realization

Generation from Visual Input

Standard pipeline:

Computer vision: Recognize

Scene

Objects

Spatial relationship

Actions

Natural language generation

Combine words/phrases from first step using

Templates

N-grams

Grammar rules

Example

End-to-End approaches (Show, Attend, Tell)

CNN Encoder of the image

LSTM-based Decoder generating the sentences

Attention mechanism to attend to different parts of the image

Examples

Retrieval

💡 Idea: Use description of similar image

Algorithm:

Extract visual feature

Retrieve most similar images using similarity function

Re-rank images

Combine retrieved descriptions

Example

Description retrieval

Visual question answering

Given:

Image

Question related to the image

Example

Output: Answer

Most common model: Joint neural network

🔴 Challenges: Multi-step reasoning

Steps

Locate objects (bike, window, street, basket and dogs)

Identify concepet (sitting)

Rule out irrelavant objects

Image model

CNN:

Often pretrained models used

Global features: Fixed size representation of the whole image

Local features: Representation of different regions of the image

Text model

Read question word by word

Answer generation

One word or free text

Input: Image features and text features

Output: Most probable word

Models:

Fully connected NN

Attention mechanism

	Lookup Speed	Memory Requirement	Update
Binary trie	$O(W)$	$O(NW)$	$O(W)$
Path compression	$O(W)$	$O(N)$	$O(W)$
Multibit trie

Zufalls- variable	Diskret	Stetig
Beispiel	Würfelwurf	Zeit Temperatur
Wahrscheinlichkeit für	bestimmter/konkreter Punkt $P(X=x) \in [0, 1]$	NUR für Intervall ($P(X=x) = 0$)
Wahrscheinlichkeitsfunktion/ Dichtefunktion	Wahrscheinlichkeitsfunktion $f(x): \Omega \rightarrow[0,1], x \in \mathbb{N}_{0}$ $f(x) = P(X=x)$ $\sum_{x \in \Omega} f(x)=1$	Dichtefunktion $f(x): \mathbf{\Omega} \rightarrow \mathbb{R}^{+}$ $f$ ist integrierbar $f(x) \geq 0 \quad \forall x \in \mathbb{R}$ $\displaystyle \int_{-\infty}^{+\infty} f(x) \mathrm{d} x=1$
Verteilungsfunktion	$F(x): \boldsymbol{\Omega} \rightarrow[\mathbf{0}, \mathbf{1}], X \in \mathbb{N}_{\mathbf{0}}$ $F(x)= P(X \leq x) = \sum_{x_{i} \leq x} f\left(x_{i}\right)$	$F(x): \Omega \rightarrow[0,1], x \in \mathbb{R}$ $F(x)=\int f(x) \mathrm{d} x, \quad f(x)=\frac{F(x)}{\mathrm{d} x}$

Zufalls- variable	Diskret	Stetig
Erwartungswert ($\mu$, $E(x)$)	$\sum_{i \in \Omega} x_{i} \cdot p_{i}$	$\int_{-\infty}^{+\infty} x \cdot f(x) \mathrm{d} x$
Varianz ($\sigma^2$, $Var(x)$)	$\sum_{i \in \Omega}\left(x_{i}-\mu\right)^{2} \cdot p_{i}$	$\int_{-\infty}^{+\infty}(x-\mu)^{2} \cdot f(x) \mathrm{d} x$
Standardabweichung ($\sigma$)	$\sqrt{Var(x)}$	$\sqrt{Var(x)}$

Flow Programming	Characteristics	Delay?	Loss of controller connectivity
Proactive	coarse grained, pre-defined	No	Does not disrupt traffic
Reactive	fine grained, on demand	Yes	New flows cannot be installed

Ethernet variants	Data rate	Topology	Medium access	Evaluation	Layers	Flow control
Original	10 Mbit/s	bus	CSMA/CD Check medium 1-persistent sending Collision detection by sender Exponential backoff	Utilization	1 and 2a
Fast Ethernet	100 Mbit/s	star	CSMA/CD			Implicit / Explicit
Gigabit Ethernet	1 Gbit/s	star	Carrier extension, frame bursting

Component	number
pod	$k$
edge switch	$\frac{k^2}{2}$
aggregation switch	$\frac{k^2}{2}$
core switch	$(\frac{k}{2})^2$
server	$\frac{k^3}{4}$
links between switches	$\frac{k^3}{2}$

	Linear	Nichtlinear
Systemabbildung	$\underline{x}_{k+1} = \mathbf{A}_k \underline{x}_k + \mathbf{B}_k (\underline{u}_k + \underline{w}_k)$	$\underline{x}_{k+1} = \underline{a}_k(\underline{x}_k, \underline{u}_k, \underline{w}_k)$
Messabbildung	$\underline{y}_{k} = \mathbf{H}_k \underline{x}_k + \underline{v}_k$	$\underline{y}_k = \underline{h}_k (\underline{x}_k, \underline{v}_k)$

	(Linear) KF	EKF
Prädiktion	$\underline{\hat{x}}_k^p = \mathbf{A}_{k-1}\underline{\hat{x}}_{k-1}^e + \mathbf{B}_{k-1} \underline{\hat{u}}_{k-1}$ $\mathbf{C}_k^p = \mathbf{A}_{k-1} \mathbf{C}_{k-1}^e A_{k-1}^\top + \mathbf{B}_{k-1} \mathbf{C}_{k-1}^w \mathbf{B}_{k-1}^\top$	$\underline{\hat{x}}_{k+1}^{p}=\underline{a}_{k}\left(\underline{\hat{x}}_{k}^{e}, \hat{\underline{u}}_{k}\right)$ $\mathbf{C}_{k+1}^{p} \approx \mathbf{A}_{k} \mathbf{C}_{k}^{e} \mathbf{A}_{k}^{\top}+\mathbf{C}_{k}^{w^{\prime}}=\mathbf{A}_{k} \mathbf{C}_{k}^{e} \mathbf{A}_{k}^{\top}+\mathbf{B}_{k} \mathbf{C}_{k}^{w} \mathbf{B}_{k}^{\top}$
Filterung	$\mathbf{K}_k = \mathbf{C}_k^p \mathbf{H}_k^\top (\mathbf{C}_k^v + \mathbf{H}_k \mathbf{C}_k^p \mathbf{H}_k ^\top)^{-1}$ $\underline{\hat{x}}_k^e = \underline{\hat{x}}_k^p + \mathbf{K}_k(\underline{\hat{y}}_k - \mathbf{H}_k \underline{\hat{x}}_k^p)$ $\mathbf{C}_k^e = (\mathbf{I} - \mathbf{K}_k\mathbf{H}_k)\mathbf{C}_k^p$	$\begin{aligned} \mathbf{K}_{k}&=\mathbf{C}_{k}^{p} \mathbf{H}_{k}^{\top}\left(\mathbf{L}_{k} \mathbf{C}_{k}^{v} \mathbf{L}_{k}^{\top}+\mathbf{H}_{k} \mathbf{C}_{k}^{p} \mathbf{H}_{k}^{T}\right)^{-1} \\ \hat{\underline{x}}_{k}^{e}&=\hat{\underline{x}}_{k}^{p}+\mathbf{K}_{k}\left[\hat{\underline{y}}_{k}-\underline{h}_{k}\left(\hat{\underline{x}}_{k}^{p}, \hat{\underline{v}}_{k}\right)\right] \\ \mathbf{C}_{k}^{e}&=\mathbf{C}_{k}^{p}-\mathbf{K}_{k} \mathbf{H}_{k} \mathbf{C}_{k}^{p} = (\mathbf{I} - \mathbf{K}_{k} \mathbf{H}_{k})\mathbf{C}_{k}^{p} \end{aligned}$
Auxiliary		$\mathbf{A}_k = \left.\frac{\partial \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)}{\partial \underline{x}_{k}^{\top}}\right\|_{\underline{x}_{k}=\underline{\hat{x}}_{k-1}^{e}, \underline{u}_{k}=\hat{\underline{u}}_{k}}$ $\mathbf{B}_k = \left.\frac{\partial \underline{a}_{k}\left(\underline{x}_{k}, \underline{u}_{k}\right)}{\partial \underline{u}_{k}^{\top}}\right\|_{\underline{x}_{k}=\underline{\hat{x}}_{k-1}^{e}, \underline{u}_{k}=\hat{\underline{u}}_{k}}$ $\mathbf{H}_{k}=\left.\frac{\partial \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right)}{\partial \underline{x}_{k}^{\top}}\right\|_{\underline{x}_{k}=\underline{\hat{x}}_{k}^{p}, \underline{v}_{k}=\underline{\hat{v}}_{k}}$ $\mathbf{L}_{k}=\left.\frac{\partial \underline{h}_{k}\left(\underline{x}_{k}, \underline{v}_{k}\right)}{\partial \underline{v}_{k}^{\top}}\right\|_{\underline{x}_{k}=\underline{\hat{x}}_{k}^{p}, \underline{v}_{k}=\underline{\hat{v}}_{k}}$

Model type	Input	Output	Example task
Classification	Fix input size (E.g. word and surrounding k words)	Label	Word sense disambiguation
Sequence classification	Sequence with variable length	Label	Sentiment analysis
Sequence labelling	Sequence with variable length	Label sequence with same length	Named entity recognition
Sequence to Sequence model	Sequence with variable length	Sequence variable length	Summarization
Structure prediction	Sequence with variable length	Complex structure	Parsing

Computer	Human
Select key phrases from the text	Abstraction of the text
No new wordings	New words

#	1-gram	reference 1-gram	2-gram	reference 2-gram
1	the	the	the cat	the cat
2	cat	cat	cat was	cat was
3	was	was	was found	was under
4	found	under	found under	under the
5	under	the	under the	the bed
6	the	bed	the bed
7	bed
count	7	6	6	5