Sklearn 源码阅读笔记（一）GMM

（一）GMM 模型简述

关于模型中_m_step更新公式推导，有两种方式

模型伪代码如下：（具体原理参考西瓜书第九章）

输入：样本集 $D = \{x_1, \cdots, x_m\}$ 高斯混合成分个数 $k$
过程：
- 初始化高斯混合分布的模型参数 $\{(a_i, \mu_i, \Sigma_i\mid 1\leq i\leq k \}$
- Repeat:
  - 计算 $x_j$ 由各个混合成分生成的后验概率 $\gamma_{ji} = p_M(z_j=i|x_j)$ (_e_step)
  - 根据后验概率计算新的模型参数。(_m_step)
- Until : 满足停止条件（收敛/max_iter）
Predict 阶段：根据 $\lambda_j = argmax_{i\in {1, 2,\cdots, k}}\gamma_{ji}$ 确定 $x_j$ 的簇标记 $\lambda_j$
输出：簇划分 $C = \{C_1, \cdots, C_k\}$

（二）Sklearn 实现源代码

源代码仓库：https://github.com/scikit-learn/scikit-learn

Sklearn 中的 GMM：sklearn/mixture/_gaussian_mixture.py

Summary

首先说一下为什么想到要写一篇blog 专门分析一下GMM在sklearn中实现的源代码：

首先，我们从上面的GMM模型伪代码中，我们可以看到：模型本身非常简洁清晰。在西瓜书上，非常全面的伪代码一共不超过20行。然而在sklearn中，为了得到更好的效率，以及得到更全面的功能，sklearn实现中一共洋洋洒洒写了400+行（不算注释）Python。可见其做的优化，并不想我们直观从西瓜书上看到的那样简单。

模型主要的优化体现在 precisions matrix 以及 precision matrix 的 cholesky 分解 用法上，主要用于提高西瓜书上 (9.28) 多元高斯分布公式的计算效率。

$p(x) = \frac{1}{(2\pi)^{\frac{n}{2}} |\Sigma|^{\frac{1}{2}}} e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)} \tag{1}$

事实上，$\Sigma$ 表示的是协方差矩阵，是对称正定矩阵, 其可用逆矩阵（precision matrix）等价代换（对称正定矩阵于其逆矩阵是一对一映射关系），使用 precision matrix 可以在 test time 加速计算后验概率 (1) 式。同时，sklearn 中采用了$\log p(x)$ 的方式计算/存储后验概率.

$\log p(x) = -\frac{n}{2} * \log(2\pi) - \frac{1}{2}\log|\Sigma| - \frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu) \tag{2}$

对比这来看sklearn 中计算log的后验函数如下：（我们故意忽略了除了’full’ 以外的方式）

def _estimate_log_gaussian_prob(X, means, precisions_chol, covariance_type):
    """Estimate the log Gaussian probability.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)

    means : array-like, shape (n_components, n_features)

    precisions_chol : array-like
        Cholesky decompositions of the precision matrices.
        'full' : shape of (n_components, n_features, n_features)
        'tied' : shape of (n_features, n_features)
        'diag' : shape of (n_components, n_features)
        'spherical' : shape of (n_components,)

    covariance_type : {'full', 'tied', 'diag', 'spherical'}

    Returns
    -------
    log_prob : array, shape (n_samples, n_components)
    """
    n_samples, n_features = X.shape
    n_components, _ = means.shape
    # det(precision_chol) is half of det(precision)
    log_det = _compute_log_det_cholesky(
        precisions_chol, covariance_type, n_features)

    if covariance_type == 'full':
        log_prob = np.empty((n_samples, n_components))
        for k, (mu, prec_chol) in enumerate(zip(means, precisions_chol)):
            y = np.dot(X, prec_chol) - np.dot(mu, prec_chol)
            log_prob[:, k] = np.sum(np.square(y), axis=1)
    elif ...

    return -.5 * (n_features * np.log(2 * np.pi) + log_prob) + log_det

从上面函数中，我们可以看到其调用了_compute_log_det_cholesky 函数来计算 $-\frac{1}{2}\log|\Sigma|$ ,我们稍微往上一翻就找到了这个函数的定义

计算 $\frac{1}{2}\log|\Sigma|$ 的方法：(事实上，用了cholesky 分解的性质 $\Sigma = L\cdot L^T$ 则 $log|L| = \frac{1}{2}log|\Sigma|$)

def _compute_log_det_cholesky(matrix_chol, covariance_type, n_features):
    """Compute the log-det of the cholesky decomposition of matrices.

    Parameters
    ----------
    matrix_chol : array-like
        Cholesky decompositions of the matrices.
        'full' : shape of (n_components, n_features, n_features)
        'tied' : shape of (n_features, n_features)
        'diag' : shape of (n_components, n_features)
        'spherical' : shape of (n_components,)

    covariance_type : {'full', 'tied', 'diag', 'spherical'}

    n_features : int
        Number of features.

    Returns
    -------
    log_det_precision_chol : array-like, shape (n_components,)
        The determinant of the precision matrix for each component.
    """
    if covariance_type == 'full':
        n_components, _, _ = matrix_chol.shape
        // 因为 L 矩阵是下三角矩阵，行列式等于对角线元素的积
        log_det_chol = (np.sum(np.log(
            matrix_chol.reshape(
                n_components, -1)[:, ::n_features + 1]), 1))

    elif ...

    return log_det_chol

M Step 中使用一下函数进行参数的更新

def _estimate_gaussian_parameters(X, resp, reg_covar, covariance_type):
    """Estimate the Gaussian distribution parameters.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        The input data array.

    resp : array-like, shape (n_samples, n_components)
        The responsibilities for each data sample in X.

    reg_covar : float
        The regularization added to the diagonal of the covariance matrices.

    covariance_type : {'full', 'tied', 'diag', 'spherical'}
        The type of precision matrices.

    Returns
    -------
    nk : array-like, shape (n_components,)
        The numbers of data samples in the current components.

    means : array-like, shape (n_components, n_features)
        The centers of the current components.

    covariances : array-like
        The covariance matrix of the current components.
        The shape depends of the covariance_type.
    """
    nk = resp.sum(axis=0) + 10 * np.finfo(resp.dtype).eps
    means = np.dot(resp.T, X) / nk[:, np.newaxis]
    covariances = {"full": _estimate_gaussian_covariances_full,
                   "tied": _estimate_gaussian_covariances_tied,
                   "diag": _estimate_gaussian_covariances_diag,
                   "spherical": _estimate_gaussian_covariances_spherical
                   }[covariance_type](resp, X, nk, means, reg_covar)
    return nk, means, covariances

可以从上面更新步骤中看出，$nk$ 代表的是西瓜书中的 $\sum_{j=1}^m\gamma_{ji}$ 表示的是所有样本中属于簇 i 的后验概率之和。计算 covariances 的时候使用的函数随着type不同变换。

计算 covariance_type 为 ‘full’ 的 covariance.

def _estimate_gaussian_covariances_full(resp, X, nk, means, reg_covar):
    """Estimate the full covariance matrices.

    Parameters
    ----------
    resp : array-like, shape (n_samples, n_components)

    X : array-like, shape (n_samples, n_features)

    nk : array-like, shape (n_components,)

    means : array-like, shape (n_components, n_features)

    reg_covar : float

    Returns
    -------
    covariances : array, shape (n_components, n_features, n_features)
        The covariance matrix of the current components.
    """
    n_components, n_features = means.shape
    covariances = np.empty((n_components, n_features, n_features))
    for k in range(n_components):
        diff = X - means[k]
        covariances[k] = np.dot(resp[:, k] * diff.T, diff) / nk[k]
        covariances[k].flat[::n_features + 1] += reg_covar
    return covariances

以上就是GMM 中Sklearn中的所有优化。关于后面模型的fit / predict 过程这里不多赘述。(后续如果有时间更新…)

Jason yuan

http://Columbine21.github.io/2020/05/24/sklearn-one/