关键节点选择的快速图聚类算法

doi:10.3778/j.issn.1673-9418.2007004

摘要/Abstract

摘要：

在众多聚类算法中，谱聚类作为一种代表性的图聚类算法，由于其对复杂数据分布的适应性强、聚类效果好等优点而受到人们的广泛关注。然而，由于其高计算时间复杂度难以应用于处理大规模数据。为提高谱聚类算法在大规模数据集上的可用性，提出关键节点选择的快速图聚类算法。该算法包含三个重要步骤：第一，提出一种充分考虑抱团性和分离性的快速节点重要性评价方法；第二，选择关键节点代替原数据集构建二分图，通过奇异值分解获得数据的近似特征向量；第三，集成多次的近似特征向量，提高近似谱聚类结果的鲁棒性。该算法将时间复杂度由谱聚类原有的[O(n3)]降低到[O(t(n+2n2))]，增强了其在大规模数据集上的可用性。通过该算法与其他七个具有代表性的谱聚类算法在五个Benchmark数据集上进行的实验分析，比较结果展示了该算法相比其他算法能够更加高效地识别数据中的复杂类结构。

关键词: 聚类分析, 图聚类, 谱聚类, 聚类集成, 关键节点选择

Abstract:

Spectral clustering has attracted extensive attention as a typical graph clustering algorithm among clustering algorithms since it has really strong adaptability to complex data distribution and great clustering effect. However, it is difficult to apply spectral clustering algorithm to large scale data due to the high time complexity. To address this issue, a fast graph clustering algorithm based on the selection of key nodes is proposed. This algorithm consists of three steps. Firstly, a fast node weight evaluation method is established based on thorough consideration of the clustering and separateness. Secondly, the key nodes are selected to replace the original data set to construct a bipartite graph, and the approximated eigenvectors of the data are obtained by singular value decomposition. Thirdly, multiple approximated eigenvectors are integrated to improve the robustness of the approximated spectral clustering results. The time complexity has been reduced from [O(n3)] to [O(t(n+2n2))], facilitating the application of spectral clustering algorithm to large scale data. Through experimental analysis of this algorithm against other 7 representative spectral clustering algorithms on 5 Benchmark data sets, the comparative results demonstrate that this algorithm can identify complex class structures in data more efficiently than other clustering algorithms.

Key words: cluster analysis, graph clustering, spectral clustering, cluster ensemble, selection of key nodes

尤坊州, 白亮. 关键节点选择的快速图聚类算法[J]. 计算机科学与探索, 2021, 15(10): 1930-1937.

YOU Fangzhou, BAI Liang. Fast Graph Clustering Algorithm Based on Selection of Key Nodes[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1930-1937.

参考文献

[1] WU X, KUMAR V, QUINLAN J, et al. Top 10 algorithms in data mining[J]. Knowledge and Information Systems, 2008, 14(1): 1-37.
[2] JAIN A K. Data clustering: 50 years beyond K-means[J]. Pat-tern Recognition Letters, 2010, 31(8): 651-666.
[3] XU D, TIAN Y. A comprehensive survey of clustering algo-rithms[J]. Annals of Data Science, 2015, 2(2): 165-193.
[4] FOWLKES E B C, MALLOWS L. A method for comparing two hierarchical clusterings[J]. Journal of the American Stati-stical Association, 1983, 78(383): 553-569.
[5] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Ber-keley Symposium on Mathematical Statistics and Probability, Berkeley, 1967. Berkeley: University of California Press,1967: 281-297.
[6] ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the 2nd International Confer-ence on Knowledge Discovery and Data Mining, Portland, 1996. Menlo Park: AAAI, 1996: 226-231.
[7] BEEFERMAN D, BERGER A. Agglomerative clustering of a search engine query log[C]//Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, 2000. New York: ACM, 2000: 407-416.
[8] SHEPITSEN A, GEMMELL J, MOBASHER B, et al. Perso-nalized recommendation in social tagging systems using hierarchical clustering[C]//Proceedings of the 2008 ACM Conference on Recommender Systems, Lausanne, Oct 23-25, 2008. New York: ACM, 2008: 259-266.
[9] SHI J B, MALIK J. Normalized cuts and image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.
[10] AGGARWAL C C, REDDY C K. Data clustering algorithms and applications[M]. Boca Raton: CRC Press, 2013.
[11] DHILLON I S. Co-clustering documents and words using bipartite spectral graph partitioning[C]//Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, Aug 26-29, 2001. New York: ACM, 2001: 269-274.
[12] FOWLKES C, BELONGIE S, CHUNG F, et al. Spectral grouping using the Nystrom method[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(2): 214-225.
[13] NIE F P, WANG X Q, DENG C, et al. Learning a structured optimal bipartite graph for co-clustering learning a structured optimal bipartite graph for co-clustering[C]//Proceedings of the 30th International Conference on Neural Information Pro-cessing Systems, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 4132-4141.
[14] SHINNOU H, SASAKI M. Spectral clustering for a large data set by reducing the similarity matrix size[C]//Proceedings of the 2008 International Conference on Language Resources and Evaluation, Marrakech, May 26-Jun 1, 2008. European Language Resources Association, 2008: 1-4.
[15] YAN D, HUANG L, JORDAN M I. Fast approximate spectral clustering[C]//Proceedings of the 15th ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining, Paris, Jun 28-Jul 1, 2009. New York: ACM, 2009: 907-916.
[16] CAI D, CHEN X. Large scale spectral clustering via landmark-based sparse representation[J]. IEEE Transactions on Cyber-netics, 2015, 45(8): 1669-1680.
[17] LIU J, WANG C, DANILEVSKY M, et al. Large-scale spectral clustering on graphs[C]//Proceedings of the 23rd Inter-national Joint Conference on Artificial Intelligence, Beijing, Aug 3-9, 2013. Menlo Park: AAAI, 2013: 1486-1492.
[18] HUANG D, WANG C D, WU J S, et al. Ultra-scalable spec-tral clustering and ensemble clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(6): 1212-1226.
[19] LEE J, LEE I, KANG J. Attention graph pooling[J]. arXiv:1904.08082, 2019.
[20] Wolfram Research, Inc. Detailed introduction of the bipartite graph[EB/OL]. [2019-11-02]. https://mathworld.wolfram.com/BipartiteGraph.html.
[21] MCDAID A F, GREENE D, HURLEY N. Normalized mutual information to evaluate overlapping community finding algo-rithms[J]. arXiv:1110.2515, 2011.
[22] STEINLEY D. Properties of the Hubert-arable adjusted rand index[J]. Psychological Methods, 2004, 9(3): 386-396.
[23] YANG M, CHEN X, NIE F, et al. Scalable normalized cut with improve spectral rotation[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Aug 19-25, 2017: 1518-1524.
[24] HE L, RAY N, GUAN Y, et al. Fast large-scale spectral clus-tering via explicit feature mapping[J]. IEEE Transactions on Cybernetics, 2019, 49(3): 1058-1071.
[25] WU J S, ZHENG W S, LAI J H, et al. Euler clustering on large-scale dataset[J]. IEEE Transactions on Big Data, 2018, 4(4): 502-515.