计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (12): 2362-2373.DOI: 10.3778/j.issn.1673-9418.2007012

• 人工智能 • 上一篇    下一篇

超簇加权的集成聚类算法

薛红艳,钱雪忠,周世兵   

  1. 江南大学 人工智能与计算机学院,江苏 无锡 214122
  • 出版日期:2021-12-01 发布日期:2021-12-09

Ensemble Clustering Algorithm Based on Weighted Super Cluster

XUE Hongyan, QIAN Xuezhong, ZHOU Shibing   

  1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2021-12-01 Published:2021-12-09

摘要:

大多数集成聚类算法使用K-means算法生成基聚类,得到的基聚类效果不太理想。通常在使用共协矩阵对基聚类进行集成时,忽视了基聚类多样性的不同,平等地对待基聚类,且以样本为操作单元生成共协矩阵。当样本数目或集成规模较大时,计算负担显著增加。针对上述问题,提出超簇加权的集成聚类算法(ECWSC)。该算法使用随机选点与K-means选点相结合来获取地标点,对地标点使用谱聚类算法得到其聚类结果,再将样本点映射到与之最近邻的地标点上生成基聚类。在此基础上,以信息熵为依据计算基聚类的不确定性,并对基聚类赋予相应权重,使用加权的方式得到加权超簇的共协矩阵,对共协矩阵使用层次聚类算法得到集成结果。选取7个真实数据集和4个人工数据集作为实验数据集,从准确度、鲁棒性和时间复杂度方面进行验证。对比实验结果表明,该算法能够有效提升集成聚类的性能。

关键词: 地标点采样, 谱聚类, 聚类集成, 共协矩阵, 加权策略

Abstract:

Most ensemble clustering algorithms use K-means to generate base clustering, but the result of base clustering is not good. And most ensemble clustering algorithms ignore the diversity of base clustering, treat base clustering equally, and generate the co-association matrix on the samples. When the number of samples or integration scale is large, the computational burden increases significantly. To solve the above problems, an ensemble clustering algorithm based on weighted super cluster (ECWSC) is proposed. This algorithm combines random selection with K-means selection to obtain landmarks sampling, and uses spectral clustering algorithm for landmarks to get the clustering result. Then, the samples are mapped to the nearest landmark points to get the base clustering. On this basis, the uncertainty of the base clustering is calculated, and the corresponding weight is given. Then the co-association matrix based on weighted super cluster is obtained by weighted method, and the integration result is obtained by using hierarchical clustering algorithm. 7 real datasets and 4 artificial datasets are selected as experimental datasets to verify the accuracy, robustness and time complexity of the methods. Experimental results show that this algorithm can effectively improve the ensemble clustering effect.

Key words: landmark sampling, spectral clustering, ensemble clustering, co-association matrix, weighted strategy