Pattern Recognition

Pattern Recognition

Why pattern recognition and what is it?

What is machine learning?

  • Motivation: Some problems are very hard to solve by writing a computer program by hand
  • Learn common patterns based on either
    • a priori knowledge or

    • on statistical information

      • Important for the adaptability to different tasks/domains
      • Try to mimic human learning / better understand human learning
  • Machine learning is concerned with developing generic algorithms, that are able to solve problems by learning from example data

Classifiers

  • Given an input pattern x\mathbf{x}, assign in to a class ω_i\omega\_i

    • Example: Given an image, assign label “face” or “non-face”
    • x\mathbf{x}: can be an image, a video, or (more commonly) any feature vector that can be extracted from them
    • ω_i\omega\_i: desired (discrete) class label
      • If “class label” is real number or vector –> Regression task
  • ML: Use example patterns with given class labels to automatically learn

    截屏2020-11-07 12.50.02
  • Example

    截屏2020-11-07 12.50.58

  • Classification process

    截屏2020-11-07 12.52.02

Bayes Classification

  • Given a feature vector x\mathbf{x}, want to know which class ω_i\omega\_i is most likely

  • Use Bayes’ rule: Decide for the class ω_i\omega\_i with maximum posterior probability

    截屏2020-11-07 12.53.45

  • 🔴 Problem: p(xω_i)p(x|\omega\_i) (and to a lesser degree P(ω_i)P(\omega\_i)) is usually unknown and often hard to estimate from data

  • Priors describe what we know about the classes before observing anything

    • Can be used to model prior knowledge

    • Sometimes easy to estimate (counting)

  • Example

    截屏2020-11-07 12.56.58

Gaussian Mixture Models

Gaussian classification

  • Assumption:

    p(xωi)N(μ,Σ)=1(2π)d/2Σ1/2exp[12(xμ)Σ1(xμ)] \mathrm{p}\left(\mathbf{x} | \omega_{\mathrm{i}}\right) \sim \mathrm{N}(\boldsymbol{\mu}, \mathbf{\Sigma})= \frac{1}{(2 \pi)^{d / 2}|\Sigma|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right]
    • 👆 This makes estimation easier
      • Only μ,Σ\boldsymbol{\mu}, \mathbf{\Sigma} need to be estimated
      • To reduce parameters, the covariance matrix can be restricted
        • Diagonal matrix –> Dimensions uncorrelated

        • Multiple of unit matrix –> Dimensions uncorrelated with same variance

  • 🔴 Problem: if the assumption(s) do not hold, the model does not represent reality well 😢

  • Estimation of μ,Σ\boldsymbol{\mu}, \mathbf{\Sigma} with Maximum (Log-)Likelihood

    • Use parameters, that best explain the data (highest likelihood):

      Lik(μ,Σ)=p(dataμ,Σ)=p(x_0,x_1,,x_nμ,Σ)=p(x_0μ,Σ)p(x_1μ,Σ)p(x_nμ,Σ) \begin{aligned} \operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma}) &= p(\text{data}|\boldsymbol{\mu}, \mathbf{\Sigma}) \\\\ &= p(\mathbf{x}\_0, \mathbf{x}\_1, \dots, \mathbf{x}\_n|\boldsymbol{\mu}, \mathbf{\Sigma}) \\\\ &= p\left(\mathbf{x}\_{0} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \cdot p\left(\mathbf{x}\_{1} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \cdot \ldots \cdot p\left(\mathbf{x}\_{\mathrm{n}} | \boldsymbol{\mu}, \mathbf{\Sigma}\right) \end{aligned} LogLik(μ,Σ)=log(Lik(μ,Σ))=_i=0nlog(x_iμ,Σ) \operatorname{LogLik}(\boldsymbol{\mu}, \mathbf{\Sigma}) = \log(\operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma})) = \sum\_{i=0}^n \log(\mathbf{x}\_i | \boldsymbol{\mu}, \mathbf{\Sigma})

      –> Maimize log(Lik(μ,Σ))\log(\operatorname{Lik}(\boldsymbol{\mu}, \mathbf{\Sigma})) over μ,Σ\boldsymbol{\mu}, \mathbf{\Sigma}

Gaussian Mixture Models (GMMs)

  • Approximate true density function using a weighted sum of several Gaussians

    p(x)=_iw_i1(2π)d/2Σ1/2exp[12(xμ)Σ1(xμ)]with_iw_i=1 \mathrm{p}(\mathbf{x})=\sum\_{i} \mathrm{w}\_{i} \frac{1}{(2 \pi)^{\mathrm{d}/2}|\mathbf{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right] \qquad \text{with} \sum\_i w\_i = 1
  • Any density can be approximated this way with arbitrary precision

    • But might need many Gaussians
    • Difficult to estimate many parameters
  • Use Expectation Maximization (EM) Algorithm to estimate parameters of the Gaussians as well as the weights

    1. Initialize parameters of GMM randomly

    2. Repeat until convergence

      • Expectation (E) step:

        Compute the probability p_ijp\_{ij} that data point ii belongs to Gaussian jj

        • Take the value of each Gaussian at point ii and normalize so they sum up to one
      • Maximization (M) step:

        Compute new GMM parameters using soft assignments p_ijp\_{ij}

        • Maximum Likelihood with data weighted according to p_ijp\_{ij}

parametric vs. non-parametric

  • Parametric classifiers

    • assume a specific form of probability distribution with some parameters
    • only the parameters need to be estimated
    • 👍 Advantage: Need less training data because less parameters to estimate
    • 👎 disadvantage: Only work well if model fits data
    • Examples: Gaussian and GMMs
  • Non-parametric classifiers

    • Do NOT assume a specific form of probability distribution

    • 👍 Advantage: Work well for all types of distributions

    • 👎 disadvantage: Need more data to correctly estimate distribution

    • Examples: Parzen windows, k-nearest neighbors

generative vs. discriminative

  • A method that models P(ω_i)P(\omega\_i) and p(xω_i)p(\mathbf{x}|\omega\_i) explicitly is called a generative model
    • p(xω_i)p(\mathbf{x}|\omega\_i) allows to generate new samples of class ω_i\omega\_i
  • The other common approach is called discriminative models
    • directly model p(ω_ix)p(\omega\_i|\mathbf{x}) or just output a decision ω_i\omega\_i given an input pattern x\mathbf{x}
    • easier to train because they solve a simpler problem 👏

Linear Discriminant Functions

  • Separate two classes ω_1,ω_2\omega\_1, \omega\_2 with a linear hyperplane

    y(x)=wTx+w0 y(x)=w^{T} x+w_{0}

    截屏2020-11-07 16.45.44

    • Decide ω_1\omega\_1 if y(x)>0y(x) > 0 else ω_2\omega\_2
    • wTw^T: normal vector of the hyperplane
  • Example:

Support Vector Machines

See: SVM

Linear SVMs

  • If the input space is already high-dimensional, linear SVMs can often perform well too
  • 👍 Advantages:
    • Speed: Only one scalar product for classification
    • Memory: Only one vector w needs to be stored
    • Training: Training is much faster
    • Model selection: Only one parameter to optimize

K-nearest Neighbours

  • 💡 Look at the kk closest training samples and assign the most frequent label among them

  • Model consists of all training samples

    • Pro: No information is lost

    • Con: A lot of data to manage

  • Naïve implementation: compute distance to each training sample every time

    • Distance metric is needed (Important design parameter!)
      • L_1L\_1, L_2L\_2, L_L\_{\infty}, Mahalanobis, … or
      • Problem-specific distances
  • kNN often good classifier, but:

    • Needs enough data
    • Scalability issues
More see: k-NN

Clustering

  • New problem setting
    • Only data points are given, NO class labels
    • Find structures in given data
  • Generally no single correct solution possible

K-means

  • Algorithm

    1. Randomly initialize k cluster centers

    2. Repeat until convergence:

      • Assign all data points to closest cluster center

      • Compute new cluster center as mean of assigned data points

  • 👍 Pros: Simple and efficient

  • 👎 Cons:

    • kk needs to be known in advance
    • Results depend on initialization
    • Does not work well for clusters that are not hyperspherical (round) or clusters that overlap
  • Very similar to the EM algorithm

    • Uses hard assignments instead of probabilistic assignments (EM)

Agglomerative Hierarchical Clustering

  • Algorithm

    1. Start with one cluster for each data point

    2. Repeat

      • Merge two closest clusters
  • Several possibilities to measure cluster distance

    • Min: minimal distance between elements

    • Max: maximal distance between elements

    • Avg: average distance between elements

    • Mean: distance between cluster means

  • Result is a tree called a dendrogram

  • Example:

    截屏2020-11-07 17.02.40

Curse of dimensionality

  • In computer vision, the extracted feature vectors are often high-dimensional

  • Many intuitions about linear algebra are no longer valid in high-dimensional spaces 🤪

    • Classifiers often work better in low-dimensional spaces
  • These problems are called “curse of dimensionality" 👿

Example

截屏2020-11-07 17.05.38

Dimensionality reduction

  • PCA: Leave out dimensions and minimize error made

    截屏2020-11-07 17.06.38
  • LDA: Maximize class separability

    截屏2020-11-07 17.06.58