计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (2): 376-384.DOI: 10.3778/j.issn.1673-9418.2104091

• 理论·算法 • 上一篇    下一篇

熵正则化下的变分深度生成聚类模型

张志远,陈亚瑞,杨剑宁,丁文强,杨巨成   

  1. 天津科技大学 人工智能学院,天津 300457
  • 出版日期:2023-02-01 发布日期:2023-02-01

Variational Deep Generative Clustering Model Under Entropy Regularizations

ZHANG Zhiyuan, CHEN Yarui, YANG Jianning, DING Wenqiang, YANG Jucheng   

  1. College of Artificial Intelligence, Tianjin University of Science & Technology, Tianjin 300457, China
  • Online:2023-02-01 Published:2023-02-01

摘要: 基于深度学习的聚类方法可以自动学习到数据的隐层特征表示,并可方便应用于高维大规模数据集上。传统深度聚类方法更多关注通过深层神经网络去提取数据的隐层特征来提升聚类精度,较少对聚类任务中数据类别的确定性问题进行分析,同时缺乏对施加约束后的离散隐向量分布的分析。提出熵正则化下的变分深度生成聚类模型(VDGC-ER),以变分自编码为基础框架,对连续向量进行高斯混合先验建模,并以高斯混合中的离散隐向量作为类别向量。通过对离散隐向量引入样本熵正则化项增强预测聚类类别的区分度,同时对离散隐向量定义聚合样本熵正则化项以降低聚类不平衡,避免局部最优,并提升生成数据多样性。之后,采用蒙特卡洛采样及重参策略估计VDGC-ER模型的优化目标,并利用随机梯度下降法求解模型参数。最后在MNIST数据集、REUTERS数据集、REUTERS-10K数据集和HHAR数据集上设计了对比实验,验证了VDGC-ER模型不仅可以生成高质量的样本,而且可以显著提升聚类精度。

关键词: 变分自编码, 概率生成模型, 变分推理, 熵正则化, 聚类

Abstract: The clustering method based on deep learning can automatically learn the latent features of data, and can be easily generalized to large-scale datasets with high-dimension. Traditional deep clustering methods pay more at-tention to extracting hidden layer features of data through deep neural networks to improve clustering accuracy, and less analyze the determinism of data categories in clustering tasks. At the same time, there is a lack of analysis of the discrete latent vector distribution after imposing constraints. This paper proposes a variational deep generative clustering model under entropy regularizations (VDGC-ER), which uses the variational auto-encoder as the basic framework and introduces the Gaussian mixture model as prior of the latent variables. This paper first proposes the sample entropy regularization term to the discrete latent vector of Gaussian mixture model to improve the clustering accuracy of the model. Further, this paper defines the aggregated sample entropy regularization term on the discrete latent vector to reduce the clustering imbalance, so that it can avoid local optimization and improve the generative diversity. Then, this paper uses the Monte Carlo sampling and re-parameterization strategies to estimate the optimi-zation objective of VDGC-ER model, and uses the stochastic gradient descent method to calculate the model para-meters. Finally, this paper designs the comparison experiments on MNIST, REUTERS, REUTERS-10K and HHAR datasets to demonstrate the performance of the VDGC-ER model. Experimental results show that the model can not only generate high quality samples, but also present high accuracy clustering.

Key words: variational autoencoder, probabilistic generative model, variational inference, entropy regularization, clustering