Gaussian Mixture Model

Gaussian Mixture Model

Gaussian Distribution

Univariate: The Probability Density Function (PDF) is:

P(xθ)=12πσ2exp((xμ)22σ2) P(x | \theta)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right)
  • μ\mu: mean
  • σ\sigma: standard deviation

gaussian mixture models

Multivariate: The Probability Density Function (PDF) is:

P(xθ)=1(2π)D2Σ12exp((xμ)TΣ1(xμ)2) P(x | \theta)=\frac{1}{(2 \pi)^{\frac{D}{2}}|\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{(x-\mu)^{T} \Sigma^{-1}(x-\mu)}{2}\right)
  • μ\mu: mean
  • Σ\Sigma: covariance
  • DD: dimension of data

gaussian mixture models

Learning

For univariate Gaussian model, we can use Maximum Likelihood Estimation (MLE) to estimate parameter θ\theta :

θ=argmaxθL(θ) \theta= \underset{\theta}{\operatorname{argmax}} L(\theta)

Assuming data are i.i.d, we have:

L(θ)=_j=1NP(x_jθ) L(\theta)=\prod\_{j=1}^{N} P\left(x\_{j} | \theta\right)

For numerical stability, we usually use Maximum Log-Likelihood:

θ=argmaxθL(θ)=argmaxθlog(L(θ))=argmaxθ_j=1NlogP(x_jθ) \begin{align} \theta &= \underset{\theta}{\operatorname{argmax}} L(\theta) \\\\ &= \underset{\theta}{\operatorname{argmax}} \log(L(\theta)) \\\\ &= \underset{\theta}{\operatorname{argmax}} \sum\_{j=1}^{N} \log P\left(x\_{j} | \theta\right)\end{align}

Gaussian Mixture Model

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.

A Gaussian mixture of three normal distributions.

Define:

  • x_jx\_j: the jj-th observed data, j=1,2,,Nj=1, 2,\dots, N

  • KK: number of Gaussian model components

  • α_k\alpha\_k: probability that the observed data belongs to the kk-th model component

    • α_k0\alpha\_k \geq 0
    • _k=1Kα_k=1\displaystyle \sum\_{k=1}^{K}\alpha\_k=1
  • ϕ(xθ_k)\phi(x|\theta\_k): probability density function of the kk-th model component

    • θ_k=(μ_k,σ_k2)\theta\_k = (\mu\_k, \sigma\_k^2)
  • γ_jk\gamma\_{jk}: probability that the jj-th obeserved data belongs to the kk-th model component

  • Probability density function of Gaussian mixture model:

    P(xθ)=_k=1Kα_kϕ(xθ_k) P(x | \theta)=\sum\_{k=1}^{K} \alpha\_{k} \phi\left(x | \theta\_{k}\right)

    For this model, parameter is θ=(μ~_k,σ~_k,α~_k)\theta=\left(\tilde{\mu}\_{k}, \tilde{\sigma}\_{k}, \tilde{\alpha}\_{k}\right).

Expectation-Maximum (EM)

Expectation-Maximization (EM) is a statistical algorithm for finding the right model parameters. We typically use EM when the data has missing values, or in other words, when the data is incomplete.

These missing variables are called latent variables.

  • NEVER observed
  • We do NOT know the correct values in advance

Since we do not have the values for the latent variables, Expectation-Maximization tries to use the existing data to determine the optimum values for these variables and then finds the model parameters. Based on these model parameters, we go back and update the values for the latent variable, and so on.

The Expectation-Maximization algorithm has two steps:

  • E-step: In this step, the available data is used to estimate (guess) the values of the missing variables
  • M-step: Based on the estimated values generated in the E-step, the complete data is used to update the parameters

EM in Gaussian Mixture Model

  • Initialize the parameters (KK Gaussian distributionw with the mean μ_1,μ_2,,μ_k\mu\_1, \mu\_2,\dots,\mu\_k and covariance Σ_1,Σ_2,,Σ_k\Sigma\_1, \Sigma\_2, \dots, \Sigma\_k)

  • Repeat

    • E-step: For each point x_jx\_j, calculate the probability that it belongs to cluster/distribution kk
    γ_jk=Probability x_j belongs to cluster kSum of probability x_j belongs to cluster 1,2,,k=α_kϕ(x_jθ_k)_k=1Kα_kϕ(x_jθ_k)j=1,2,,N;k=1,2,K \begin{align} \gamma\_{j k} &= \frac{\text{Probability } x\_j \text{ belongs to cluster } k}{\text{Sum of probability } x\_j \text{ belongs to cluster } 1, 2, \dots, k} \\\\ &= \frac{\alpha\_{k} \phi\left(x\_{j} | \theta\_{k}\right)}{\sum\_{k=1}^{K} \alpha\_{k} \phi\left(x\_{j} | \theta\_{k}\right)}\qquad j=1,2, \ldots, N ; k=1,2 \ldots, K \end{align}

    ​ The value will be high when the point is assigned to the right cluster and lower otherwise

    • M-step: update parameters
α_k=Number of points assigned to cluster kTotal number of points=_j=1Nγ_jkNk=1,2,,K \alpha\_k = \frac{\text{Number of points assigned to cluster } k}{\text{Total number of points}} = \frac{\sum\_{j=1}^{N} \gamma\_{j k}}{N} \qquad k=1,2, \ldots, K μ_k=_jN(γ_jkx_j)_jNγ_jkk=1,2,,K \mu\_{k}=\frac{\sum\_{j}^{N}\left(\gamma\_{j k} x\_{j}\right)}{\sum\_{j}^{N} \gamma\_{j k}}\qquad k=1,2, \ldots, K Σ_k=_jNγ_jk(x_jμ_k)(x_jμ_k)T_jNγ_jkk=1,2,,K \Sigma\_{k}=\frac{\sum\_{j}^{N} \gamma\_{j k}\left(x\_{j}-\mu\_{k}\right)\left(x\_{j}-\mu\_{k}\right)^{T}}{\sum\_{j}^{N} \gamma\_{j k}} \qquad k=1,2, \ldots, K

until convergence (θ_i+1θ_i<ε\left\|\theta\_{i+1}-\theta\_{i}\right\|<\varepsilon)

Visualization:

The EM algorithm updating the parameters of a two-component bivariate Gaussian mixture model.

Reference