Sklearn 源码阅读笔记(一)GMM
(一)GMM 模型简述
关于模型中_m_step更新公式推导,有两种方式
(推荐一:https://datawhalechina.github.io/pumpkin-book/#/chapter9/chapter9)
(推荐二:https://zhuanlan.zhihu.com/p/24709748)
模型伪代码如下:(具体原理参考西瓜书第九章)
- 输入: 样本集 $D = \{x_1, \cdots, x_m\}$ 高斯混合成分个数 $k$
- 过程:
- 初始化高斯混合分布的模型参数 $\{(a_i, \mu_i, \Sigma_i\mid 1\leq i\leq k \}$
- Repeat:
- 计算 $x_j$ 由各个混合成分生成的后验概率 $\gamma_{ji} = p_M(z_j=i|x_j)$ (_e_step)
- 根据后验概率计算新的模型参数。(_m_step)
- Until : 满足停止条件(收敛/max_iter)
- Predict 阶段:根据 $\lambda_j = argmax_{i\in {1, 2,\cdots, k}}\gamma_{ji}$ 确定 $x_j$ 的簇标记 $\lambda_j$
- 输出:簇划分 $C = \{C_1, \cdots, C_k\}$
(二)Sklearn 实现源代码
源代码仓库:https://github.com/scikit-learn/scikit-learn
Sklearn 中的 GMM:sklearn/mixture/_gaussian_mixture.py
Summary
首先说一下为什么想到要写一篇blog 专门分析一下GMM在sklearn中实现的源代码:
首先,我们从上面的GMM模型伪代码中,我们可以看到:模型本身非常简洁清晰。在西瓜书上,非常全面的伪代码一共不超过20行。然而在sklearn中,为了得到更好的效率,以及得到更全面的功能,sklearn实现中一共洋洋洒洒写了400+行(不算注释)Python。可见其做的优化,并不想我们直观从西瓜书上看到的那样简单。
模型主要的优化体现在 precisions matrix 以及 precision matrix 的 cholesky 分解 用法上,主要用于提高西瓜书上 (9.28) 多元高斯分布公式的计算效率。
事实上,$\Sigma$ 表示的是协方差矩阵,是对称正定矩阵, 其可用逆矩阵(precision matrix)等价代换(对称正定矩阵于其逆矩阵是一对一映射关系),使用 precision matrix 可以在 test time 加速计算后验概率 (1) 式。同时,sklearn 中采用了$\log p(x)$ 的方式计算/存储后验概率.
对比这来看sklearn 中计算log的后验函数如下:(我们故意忽略了除了’full’ 以外的方式)
def _estimate_log_gaussian_prob(X, means, precisions_chol, covariance_type): """Estimate the log Gaussian probability. Parameters ---------- X : array-like, shape (n_samples, n_features) means : array-like, shape (n_components, n_features) precisions_chol : array-like Cholesky decompositions of the precision matrices. 'full' : shape of (n_components, n_features, n_features) 'tied' : shape of (n_features, n_features) 'diag' : shape of (n_components, n_features) 'spherical' : shape of (n_components,) covariance_type : {'full', 'tied', 'diag', 'spherical'} Returns ------- log_prob : array, shape (n_samples, n_components) """ n_samples, n_features = X.shape n_components, _ = means.shape # det(precision_chol) is half of det(precision) log_det = _compute_log_det_cholesky( precisions_chol, covariance_type, n_features) if covariance_type == 'full': log_prob = np.empty((n_samples, n_components)) for k, (mu, prec_chol) in enumerate(zip(means, precisions_chol)): y = np.dot(X, prec_chol) - np.dot(mu, prec_chol) log_prob[:, k] = np.sum(np.square(y), axis=1) elif ... return -.5 * (n_features * np.log(2 * np.pi) + log_prob) + log_det
从上面函数中,我们可以看到其调用了_compute_log_det_cholesky 函数来计算 $-\frac{1}{2}\log|\Sigma|$ ,我们稍微往上一翻就找到了这个函数的定义
计算 $\frac{1}{2}\log|\Sigma|$ 的方法:(事实上,用了cholesky 分解的性质 $\Sigma = L\cdot L^T$ 则 $log|L| = \frac{1}{2}log|\Sigma|$)
def _compute_log_det_cholesky(matrix_chol, covariance_type, n_features): """Compute the log-det of the cholesky decomposition of matrices. Parameters ---------- matrix_chol : array-like Cholesky decompositions of the matrices. 'full' : shape of (n_components, n_features, n_features) 'tied' : shape of (n_features, n_features) 'diag' : shape of (n_components, n_features) 'spherical' : shape of (n_components,) covariance_type : {'full', 'tied', 'diag', 'spherical'} n_features : int Number of features. Returns ------- log_det_precision_chol : array-like, shape (n_components,) The determinant of the precision matrix for each component. """ if covariance_type == 'full': n_components, _, _ = matrix_chol.shape // 因为 L 矩阵是下三角矩阵,行列式等于对角线元素的积 log_det_chol = (np.sum(np.log( matrix_chol.reshape( n_components, -1)[:, ::n_features + 1]), 1)) elif ... return log_det_chol
M Step 中使用一下函数进行参数的更新
def _estimate_gaussian_parameters(X, resp, reg_covar, covariance_type): """Estimate the Gaussian distribution parameters. Parameters ---------- X : array-like, shape (n_samples, n_features) The input data array. resp : array-like, shape (n_samples, n_components) The responsibilities for each data sample in X. reg_covar : float The regularization added to the diagonal of the covariance matrices. covariance_type : {'full', 'tied', 'diag', 'spherical'} The type of precision matrices. Returns ------- nk : array-like, shape (n_components,) The numbers of data samples in the current components. means : array-like, shape (n_components, n_features) The centers of the current components. covariances : array-like The covariance matrix of the current components. The shape depends of the covariance_type. """ nk = resp.sum(axis=0) + 10 * np.finfo(resp.dtype).eps means = np.dot(resp.T, X) / nk[:, np.newaxis] covariances = {"full": _estimate_gaussian_covariances_full, "tied": _estimate_gaussian_covariances_tied, "diag": _estimate_gaussian_covariances_diag, "spherical": _estimate_gaussian_covariances_spherical }[covariance_type](resp, X, nk, means, reg_covar) return nk, means, covariances
可以从上面更新步骤中看出,$nk$ 代表的是 西瓜书中的 $\sum_{j=1}^m\gamma_{ji}$ 表示的是所有样本中属于簇 i 的后验概率之和。计算 covariances 的时候使用的函数随着type不同变换。
计算 covariance_type 为 ‘full’ 的 covariance.
def _estimate_gaussian_covariances_full(resp, X, nk, means, reg_covar): """Estimate the full covariance matrices. Parameters ---------- resp : array-like, shape (n_samples, n_components) X : array-like, shape (n_samples, n_features) nk : array-like, shape (n_components,) means : array-like, shape (n_components, n_features) reg_covar : float Returns ------- covariances : array, shape (n_components, n_features, n_features) The covariance matrix of the current components. """ n_components, n_features = means.shape covariances = np.empty((n_components, n_features, n_features)) for k in range(n_components): diff = X - means[k] covariances[k] = np.dot(resp[:, k] * diff.T, diff) / nk[k] covariances[k].flat[::n_features + 1] += reg_covar return covariances
以上就是GMM 中Sklearn中的所有优化。关于后面模型的fit / predict 过程这里不多赘述。(后续如果有时间更新…)