Ensemble Clustering Algorithm Based on Weighted Super Cluster

doi:10.3778/j.issn.1673-9418.2007012

Abstract

Abstract:

Most ensemble clustering algorithms use K-means to generate base clustering, but the result of base clustering is not good. And most ensemble clustering algorithms ignore the diversity of base clustering, treat base clustering equally, and generate the co-association matrix on the samples. When the number of samples or integration scale is large, the computational burden increases significantly. To solve the above problems, an ensemble clustering algorithm based on weighted super cluster (ECWSC) is proposed. This algorithm combines random selection with K-means selection to obtain landmarks sampling, and uses spectral clustering algorithm for landmarks to get the clustering result. Then, the samples are mapped to the nearest landmark points to get the base clustering. On this basis, the uncertainty of the base clustering is calculated, and the corresponding weight is given. Then the co-association matrix based on weighted super cluster is obtained by weighted method, and the integration result is obtained by using hierarchical clustering algorithm. 7 real datasets and 4 artificial datasets are selected as experimental datasets to verify the accuracy, robustness and time complexity of the methods. Experimental results show that this algorithm can effectively improve the ensemble clustering effect.

Key words: landmark sampling, spectral clustering, ensemble clustering, co-association matrix, weighted strategy

摘要：

大多数集成聚类算法使用K-means算法生成基聚类，得到的基聚类效果不太理想。通常在使用共协矩阵对基聚类进行集成时，忽视了基聚类多样性的不同，平等地对待基聚类，且以样本为操作单元生成共协矩阵。当样本数目或集成规模较大时，计算负担显著增加。针对上述问题，提出超簇加权的集成聚类算法（ECWSC）。该算法使用随机选点与K-means选点相结合来获取地标点，对地标点使用谱聚类算法得到其聚类结果，再将样本点映射到与之最近邻的地标点上生成基聚类。在此基础上，以信息熵为依据计算基聚类的不确定性，并对基聚类赋予相应权重，使用加权的方式得到加权超簇的共协矩阵，对共协矩阵使用层次聚类算法得到集成结果。选取7个真实数据集和4个人工数据集作为实验数据集，从准确度、鲁棒性和时间复杂度方面进行验证。对比实验结果表明，该算法能够有效提升集成聚类的性能。

关键词: 地标点采样, 谱聚类, 聚类集成, 共协矩阵, 加权策略

XUE Hongyan, QIAN Xuezhong, ZHOU Shibing. Ensemble Clustering Algorithm Based on Weighted Super Cluster[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(12): 2362-2373.

薛红艳, 钱雪忠, 周世兵. 超簇加权的集成聚类算法[J]. 计算机科学与探索, 2021, 15(12): 2362-2373.

References

[1] YU Z W, WONG H S, WANG H Q. Graph-based consensus clustering for class discovery from gene expression data[J]. Bioinformatics, 2007, 23(21): 2888-2896.
[2] YANG Y, JIANG J M. Hybrid sampling-based clustering ensemble with global and local constitutions[J]. IEEE Trans-actions on Neural Networks and Learning Systems, 2016, 27(5): 952-965.
[3] IAM-ON N, BOONGOEN T, GARRETT S. LCE: a link-based cluster ensemble method for improved gene expres-sion data analysis[J]. Bioinformatics, 2010, 26(12): 1513-1519.
[4] MINAEI-BIDGOLI B, TOPCHY A P, PUNCH W F. Ensem-bles of partitions via data resampling[C]//Proceedings of the 2004 International Conference on Information Technology: Coding and Computing, Las Vegas, Apr 5-7, 2004. Wash-ington: IEEE Computer Society, 2004: 188-192.
[5] HUANG D, WANG C D, WU J S, et al. Ultra-scalable spectral clustering and ensemble clustering[J]. IEEE Transactions on Knowledge & Data Engineering, 2020, 32(6): 1212-1226.
[6] FRED A L N, JAIN A K. Combining multiple clusterings using evidence accumulation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(6): 835-850.
[7] STREHL A, GHOSH J. A knowledge reuse framework for combining multiple partitions[J]. Journal of Machine Learn-ing Research, 2003, 3(3): 583-617.
[8] MIMAROGLU S, ERDIL E. An efficient and scalable family of algorithms for combining clusterings[J]. Engineering Applications of Artificial Intelligence, 2013, 26(10): 2525-2539.
[9] HUANG D, WANG C D, LAI J H. Locally weighted ense-mble clustering[J]. IEEE Transactions on Cybernetics, 2017, 48(5): 1460-1473.
[10] HAN Z J. An adaptive K-means initialization method based on data density[J]. Computer Applications and Software, 2014, 31(2): 182-187.
韩最蛟. 基于数据密集性的自适应K均值初始化方法[J]. 计算机应用与软件, 2014, 31(2): 182-187.
[11] WANG X, YANG C, ZHOU J. Clustering aggregation by probability accumulation[J]. Pattern Recognition, 2009, 42(5): 668-675.
[12] TAO Z Q, LIU H F, LI S, et al. Robust spectral ensemble clustering via rank minimization[J]. ACM Transactions on Knowledge Discovery from Data, 2019, 13(1): 4.
[13] LIANG Y N, REN Z G, WU Z Z, et al. Scalable spectral ensemble clustering via building representative co-association matrix[J]. Neurocomputing, 2020, 390: 158-167.
[14] HUANG D, LAI J H, WANG C D. Robust ensemble clus-tering using probability trajectories[J]. IEEE Transactions on Knowledge & Data Engineering, 2016, 28(5): 1312-1326.
[15] HUANG D, WANG C D, PENG H X, et al. Enhanced en-semble clustering via fast propagation of cluster-wise simi-larities[J]. IEEE Transactions on Systems, Man, and Cyber-netics: Systems, 2021, 51(1): 508-520.
[16] IAM-ON N, BOONGOEN T, GARRETT S M. Refining pairwise similarity matrix for cluster ensemble problem with cluster relations[C]//LNCS 5255: Proceedings of the 11th International Conference on Discovery Science, Budapest, Oct 13-16, 2008. Berlin, Heidelberg: Springer, 2008: 222-233.
[17] GARRETT S. A link-based cluster ensemble method for im-proved gene expression data analysis[J]. Bioinformatics, 2010, 26(12): 1513-1519.
[18] HUANG D, LAI J H, WANG C D. Ensemble clustering using factor graph[J]. Pattern Recognition, 2016, 50(C): 131-142.
[19] CAI D, CHEN X L. Large scale spectral clustering via landmark-based sparse representation[J]. IEEE Transactions on Cyber-netics, 2015, 45(8): 1669-1680.
[20] HUANG F L, HUANG M X, YUAN C A, et al. Spectral clustering ensemble algorithm for discovering overlapping communities in social networks[J]. Control and Decision, 2014, 29(4): 713-718.
黄发良, 黄名选, 元昌安, 等. 网络重叠社区发现的谱聚类集成算法[J]. 控制与决策, 2014, 29(4): 713-718.
[21] BACHEM O, LUCIC M, HASSANI S H, et al. Fast and provably good seedings for k-means[C]//Proceedings of the Annual Conference on Neural Information Processing Sys-tems, Barcelona, Dec 5-10, 2016. Red Hook: Curran Assoc-iates, 2016: 55-63.
[22] HONG M, JIA C Y, WANG X Y. Research on initialization of K-means type multi-view clustering[J]. Journal of Fron-tiers of Computer Science and Technology, 2019, 13(4): 574-585.
洪敏, 贾彩燕, 王晓阳. K-means型多视图聚类中的初始化问题研究[J]. 计算机科学与探索, 2019, 13(4): 574-585.
[23] ZHANG N N, GE H W. Stable K multiple-means clustering algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(5): 941-948.
张倪妮, 葛洪伟. 稳定的K-多均值聚类算法[J]. 计算机科学与探索, 2021, 15(5): 941-948.
[24] KUNCHEVA L I, HADJITODOROV S T. Using diversity in cluster ensembles[C]//Proceedings of the 2004 IEEE In-ternational Conference on Systems, Man and Cybernetics, The Hague, Oct 10-13, 2004. Piscataway: IEEE, 2004: 1214-1219.
[25] HONG M, JIA C Y, LI Y F, et al. Sample-weighted multi-view clustering[J]. Journal of Computer Research and Deve-lopment, 2019, 56(8): 1677-1685.
洪敏, 贾彩燕, 李亚芳, 等. 样本加权的多视图聚类算法[J]. 计算机研究与发展, 2019, 56(8): 1677-1685.
[26] NAVARRO J F, FRENK C S, WHITE S D M. A universal density profile from hierarchical clustering[J]. Astrophysical Journal, 1996, 490(2): 493.