计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (10): 1949-1957.DOI: 10.3778/j.issn.1673-9418.2007010

• 人工智能 • 上一篇    下一篇

基于集成学习的改进深度嵌入聚类算法

黄宇翔,黄栋,王昌栋,赖剑煌   

  1. 1. 华南农业大学 数学与信息学院,广州 510642
    2. 广州市智慧农业重点实验室,广州 510642
    3. 中山大学 计算机学院,广州 510006
  • 出版日期:2021-10-01 发布日期:2021-09-30

Improved Deep Embedding Clustering with Ensemble Learning

HUANG Yuxiang, HUANG Dong, WANG Changdong, LAI Jianhuang   

  1. 1. College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China
    2. Guangzhou Key Laboratory of Intelligent Agriculture, Guangzhou 510642, China
    3. School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, China
  • Online:2021-10-01 Published:2021-09-30

摘要:

近年来深度学习的迅速发展为聚类研究提供了一个有力的工具,并衍生出了许多基于深度神经网络的聚类方法。在这些方法中,深度嵌入聚类(DEC)因其可对深度表示学习和聚类分配同时进行优化的优势而日益受到关注。但是,深度嵌入聚类的一个局限性在于其超参数λ的敏感性,而往往需要诉诸人工调节来解决。对此,提出一种基于集成学习的改进深度嵌入聚类(IDECEL)方法。相较于寻求单个最优超参数的常规做法,提出以多样化超参数λ构建一组具有差异性的基聚类,并结合熵理论对基聚类集合的簇不确定性进行评估与加权,进而在簇与样本之间构建一个局部加权二部图模型,再将之高效划分以得到一个更优聚类结果。在多个数据集上的实验结果表明,提出的IDECEL方法不仅可缓解常规DEC算法超参数敏感性的问题,同时也表现出比其他多个深度聚类和集成聚类方法更为鲁棒的聚类性能。

关键词: 数据聚类, 深度聚类, 集成聚类, 集成学习, 敏感超参数

Abstract:

Recently the rapid development of the deep learning technique has provided a powerful tool for the clustering research, and has given rise to quite a number of deep neural network-based clustering methods. Among these methods, deep embedding clustering (DEC) has been drawing increasing attention, due to its advantage in performing deep representation learning and optimizing clustering assignment simultaneously. However, one limita-tion to DEC lies in its sensitivity to the hyper-parameter λ, which often requires manual fine-tuning. To address this problem, this paper presents an improved deep embedding clustering method with ensemble learning (IDECEL). Instead of searching for a single optimal hyper-parameter, this paper makes use of a set of diversified hyper-parameters λ to construct an ensemble of diversified base clusterings. By exploiting the concept of entropy, this paper evaluates the uncertainty of the clusters in these base clusterings and weights them accordingly. Further, this paper constructs a locally weighted bipartite graph between base clusters and data samples, and efficiently partitions it to obtain a better clustering result. Experimental results on multiple datasets show that the proposed IDECEL method not only alleviates the hyper-parameter sensitivity problem in DEC, but also exhibits more robust clustering performance than several other deep clustering and ensemble clustering methods.

Key words: data clustering, deep clustering, ensemble clustering, ensemble learning, sensitive hyper-parameters