Pattern Recognition

Why pattern recognition and what is it?

What is machine learning?

  • Motivation: Some problems are very hard to solve by writing a computer program by hand
  • Learn common patterns based on either
    • a priori knowledge or

    • on statistical information

      • Important for the adaptability to different tasks/domains
      • Try to mimic human learning / better understand human learning
  • Machine learning is concerned with developing generic algorithms, that are able to solve problems by learning from example data

Classifiers

  • Given an input pattern $\mathbf{x}$, assign in to a class $\omega_i$

    • Example: Given an image, assign label β€œface” or β€œnon-face”
    • $\mathbf{x}$: can be an image, a video, or (more commonly) any feature vector that can be extracted from them
    • $\omega_i$: desired (discrete) class label
      • If β€œclass label” is real number or vector –> Regression task
  • ML: Use example patterns with given class labels to automatically learn

    ζˆͺ屏2020-11-07 12.50.02
  • Example

    ζˆͺ屏2020-11-07 12.50.58

  • Classification process

    ζˆͺ屏2020-11-07 12.52.02

Bayes Classification

  • Given a feature vector $\mathbf{x}$, want to know which class $\omega_i$ is most likely

  • Use Bayes’ rule: Decide for the class $\omega_i$ with maximum posterior probability

    ζˆͺ屏2020-11-07 12.53.45

  • πŸ”΄ Problem: $p(x|\omega_i)$ (and to a lesser degree $P(\omega_i)$) is usually unknown and often hard to estimate from data

  • Priors describe what we know about the classes before observing anything

    • Can be used to model prior knowledge

    • Sometimes easy to estimate (counting)

  • Example

    ζˆͺ屏2020-11-07 12.56.58

Gaussian Mixture Models

Gaussian classification

  • Assumption: $$ \mathrm{p}\left(\mathbf{x} | \omega_{\mathrm{i}}\right) \sim \mathrm{N}(\boldsymbol{\mu}, \mathbf{\Sigma})= \frac{1}{(2 \pi)^{d / 2}|\Sigma|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right] $$

    • πŸ‘† This makes estimation easier
      • Only $\boldsymbol{\mu}, \mathbf{\Sigma}$ need to be estimated
      • To reduce parameters, the covariance matrix can be restricted
        • Diagonal matrix –> Dimensions uncorrelated

        • Multiple of unit matrix –> Dimensions uncorrelated with same variance

  • πŸ”΄ Problem: if the assumption(s) do not hold, the model does not represent reality well 😒

  • Estimation of $\boldsymbol{\mu}, \mathbf{\Sigma}$ with Maximum (Log-)Likelihood

    • Use parameters, that best explain the data (highest likelihood): $$ \begin{aligned} \operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma}) &= p(\text{data}|\boldsymbol{\mu}, \mathbf{\Sigma}) \\ &= p(\mathbf{x}_0, \mathbf{x}_1, \dots, \mathbf{x}_n|\boldsymbol{\mu}, \mathbf{\Sigma}) \\ &= p\left(\mathbf{x}_{0} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \cdot p\left(\mathbf{x}_{1} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \cdot \ldots \cdot p\left(\mathbf{x}_{\mathrm{n}} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \end{aligned} $$

      $$ \operatorname{LogLik}(\boldsymbol{\mu}, \mathbf{\Sigma}) = \log(\operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma})) = \sum_{i=0}^n \log(\mathbf{x}_i | \boldsymbol{\mu}, \mathbf{\Sigma}) $$

      –> Maimize $\log(\operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma}))$ over $\boldsymbol{\mu}, \mathbf{\Sigma}$

Gaussian Mixture Models (GMMs)

  • Approximate true density function using a weighted sum of several Gaussians $$ \mathrm{p}(\mathbf{x})=\sum_{i} \mathrm{w}_{i} \frac{1}{(2 \pi)^{\mathrm{d}/2}|\mathbf{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right] \qquad \text{with} \sum_i w_i = 1 $$

  • Any density can be approximated this way with arbitrary precision

    • But might need many Gaussians
    • Difficult to estimate many parameters
  • Use Expectation Maximization (EM) Algorithm to estimate parameters of the Gaussians as well as the weights

    1. Initialize parameters of GMM randomly

    2. Repeat until convergence

      • Expectation (E) step:

        Compute the probability $p_{ij}$ that data point $i$ belongs to Gaussian $j$

        • Take the value of each Gaussian at point $i$ and normalize so they sum up to one
      • Maximization (M) step:

        Compute new GMM parameters using soft assignments $p_{ij}$

        • Maximum Likelihood with data weighted according to $p_{ij}$

parametric vs. non-parametric

  • Parametric classifiers

    • assume a specific form of probability distribution with some parameters
    • only the parameters need to be estimated
    • πŸ‘ Advantage: Need less training data because less parameters to estimate
    • πŸ‘Ž disadvantage: Only work well if model fits data
    • Examples: Gaussian and GMMs
  • Non-parametric classifiers

    • Do NOT assume a specific form of probability distribution

    • πŸ‘ Advantage: Work well for all types of distributions

    • πŸ‘Ž disadvantage: Need more data to correctly estimate distribution

    • Examples: Parzen windows, k-nearest neighbors

generative vs. discriminative

  • A method that models $P(\omega_i)$ and $p(\mathbf{x}|\omega_i)$ explicitly is called a generative model
    • $p(\mathbf{x}|\omega_i)$ allows to generate new samples of class $\omega_i$
  • The other common approach is called discriminative models
    • directly model $p(\omega_i|\mathbf{x})$ or just output a decision $\omega_i$ given an input pattern $\mathbf{x}$
    • easier to train because they solve a simpler problem πŸ‘

Linear Discriminant Functions

  • Separate two classes $\omega_1, \omega_2$ with a linear hyperplane $$ y(x)=w^{T} x+w_{0} $$ ζˆͺ屏2020-11-07 16.45.44

    • Decide $\omega_1$ if $y(x) > 0$ else $\omega_2$
    • $w^T$: normal vector of the hyperplane
  • Example:

Support Vector Machines

See: SVM

Linear SVMs

  • If the input space is already high-dimensional, linear SVMs can often perform well too
  • πŸ‘ Advantages:
    • Speed: Only one scalar product for classification
    • Memory: Only one vector w needs to be stored
    • Training: Training is much faster
    • Model selection: Only one parameter to optimize

K-nearest Neighbours

  • πŸ’‘ Look at the $k$ closest training samples and assign the most frequent label among them

  • Model consists of all training samples

    • Pro: No information is lost

    • Con: A lot of data to manage

  • Naïve implementation: compute distance to each training sample every time

    • Distance metric is needed (Important design parameter!)
      • $L_1$, $L_2$, $L_{\infty}$, Mahalanobis, … or
      • Problem-specific distances
  • kNN often good classifier, but:

    • Needs enough data
    • Scalability issues
More see: k-NN

Clustering

  • New problem setting
    • Only data points are given, NO class labels
    • Find structures in given data
  • Generally no single correct solution possible

K-means

  • Algorithm

    1. Randomly initialize k cluster centers

    2. Repeat until convergence:

      • Assign all data points to closest cluster center

      • Compute new cluster center as mean of assigned data points

  • πŸ‘ Pros: Simple and efficient

  • πŸ‘Ž Cons:

    • $k$ needs to be known in advance
    • Results depend on initialization
    • Does not work well for clusters that are not hyperspherical (round) or clusters that overlap
  • Very similar to the EM algorithm

    • Uses hard assignments instead of probabilistic assignments (EM)

Agglomerative Hierarchical Clustering

  • Algorithm

    1. Start with one cluster for each data point

    2. Repeat

      • Merge two closest clusters
  • Several possibilities to measure cluster distance

    • Min: minimal distance between elements

    • Max: maximal distance between elements

    • Avg: average distance between elements

    • Mean: distance between cluster means

  • Result is a tree called a dendrogram

  • Example:

    ζˆͺ屏2020-11-07 17.02.40

Curse of dimensionality

  • In computer vision, the extracted feature vectors are often high-dimensional

  • Many intuitions about linear algebra are no longer valid in high-dimensional spaces πŸ€ͺ

    • Classifiers often work better in low-dimensional spaces
  • These problems are called β€œcurse of dimensionality" πŸ‘Ώ

Example

ζˆͺ屏2020-11-07 17.05.38

Dimensionality reduction

  • PCA: Leave out dimensions and minimize error made

    ζˆͺ屏2020-11-07 17.06.38
  • LDA: Maximize class separability

    ζˆͺ屏2020-11-07 17.06.58
Next