
计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (11): 2873-2894.DOI: 10.3778/j.issn.1673-9418.2502024
史东艳,马乐荣,丁苍峰,宁秦伟,曹江江
出版日期:2025-11-01
发布日期:2025-10-30
SHI Dongyan, MA Lerong, DING Cangfeng, NING Qinwei, CAO Jiangjiang
Online:2025-11-01
Published:2025-10-30
摘要: 文本聚类是无监督学习的核心技术之一,其目标是将海量文本数据自动划分为若干语义高度相似的簇。近年来,基于深度学习的文本聚类取得蓬勃发展,研究焦点逐步转向利用先进的深度学习架构来高效提取文本特征,以进一步提高聚类结果的准确性。特别是依托RoBERTa和GPT等大型预训练语言模型的聚类策略,凭借其强大的预训练特征表示能力,已展现出卓越的性能优势。通过实例和数据的方式,全面回顾了文本聚类的发展历程、当前进展及其任务特性,旨在直观呈现其最新发展趋势及在数据挖掘领域的重要影响力。创新性地提出了一种面向深度学习架构特征的文本聚类模型分类方式。该分类方式依据模型在聚类任务中的核心机制与特征提取路径进行划分,内容涵盖从传统聚类算法到前沿技术的全面介绍,包括K-means、谱聚类、自编码器、生成模型、图卷积神经网络以及大型语言模型等多种方法,并对其具体实现细节进行深入分析。最后分析了现有方法的优势与局限,并在此基础上探讨未来可能的研究方向。
史东艳, 马乐荣, 丁苍峰, 宁秦伟, 曹江江. 深度学习方法下的文本聚类模型研究进展[J]. 计算机科学与探索, 2025, 19(11): 2873-2894.
SHI Dongyan, MA Lerong, DING Cangfeng, NING Qinwei, CAO Jiangjiang. Advances in Text Clustering Models Based on Deep Learning Approaches[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(11): 2873-2894.
| [1] DUARTE J M, BERTON L. A review of semi-supervised learning for text classification[J]. Artificial Intelligence Review, 2023, 56(9): 9401-9469. [2] 程宏兵, 王本安, 陈友荣, 等. 基于高斯混合模型和自适应簇数的文本聚类[J]. 浙江工业大学学报, 2023, 51(6): 602-609. CHENG H B, WANG B A, CHEN Y R, et al. Text clustering based on Gaussian mixture model and self-adaptive number of clusters[J]. Journal of Zhejiang University of Technology, 2023, 51(6): 602-609. [3] SUBAKTI A, MURFI H, HARIADI N. The performance of BERT as data representation of text clustering[J]. Journal of Big Data, 2022, 9(1): 15. [4] 张蕾, 姜宇, 孙莉. 一种改进型TF-IDF文本聚类方法[J]. 吉林大学学报(理学版), 2021, 59(5): 1199-1204. ZHANG L, JIANG Y, SUN L. An improved TF-IDF text clustering method[J]. Journal of Jilin University (Science Edition), 2021, 59(5): 1199-1204. [5] ZHANG Y, JIN R, ZHOU Z H. Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1): 43-52. [6] RONEN M, FINDER S E, FREIFELD O. DeepDPM: deep clustering with an unknown number of clusters[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 9851-9860. [7] REN Y Z, PU J Y, YANG Z M, et al. Deep clustering: a comprehensive survey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(4): 5858-5878. [8] LI Y G, WU H Y. A clustering method based on K-means algorithm[J]. Physics Procedia, 2012, 25: 1104-1109. [9] VON LUXBURG U. A tutorial on spectral clustering[J]. Statistics and Computing, 2007, 17(4): 395-416. [10] BALDI P. Autoencoders, unsupervised learning, and deep architectures[C]//Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning Workshop, 2011: 37-50. [11] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144. [12] PINHEIRO CINELLI L, ARAúJO MARINS M, BARROS DA SILVA E A, et al. Variational autoencoder[M]//Variational methods for machine learning with applications to deep networks. Cham: Springer, 2021: 111-149. [13] KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[EB/OL]. [2025-01-20]. https://arxiv.org/abs/1609.02907. [14] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2025-01-20]. https://arxiv.org/abs/1907.11692. [15] OPENAI, ACHIAM J, ADLER S, et al. GPT-4 technical report[EB/OL]. [2025-01-20]. https://arxiv.org/abs/2303.08774. [16] 郑璐依, 黄瑞章, 任丽娜, 等. 关键语义信息补足的深度文本聚类算法[J]. 计算机应用研究, 2023, 40(6): 1653-1659. ZHENG L Y, HUANG R Z, REN L N, et al. Deep document clustering method via key semantic information complementation[J]. Application Research of Computers, 2023, 40(6): 1653-1659. [17] 邹梦. 基于深度学习的中文文本表示与分类方法研究[D]. 大庆: 东北石油大学, 2023. ZOU M. Research on Chinese text representation and classification based on deep learning[D]. Daqing: Northeast Petroleum University, 2023. [18] DIALLO B, HU J, LI T R, et al. Deep embedding clustering based on contractive autoencoder[J]. Neurocomputing, 2021, 433: 96-107. [19] GUO W G, LIN K Y, YE W. Deep embedded K-means clustering[C]//Proceedings of the 2021 International Conference on Data Mining Workshops. Piscataway: IEEE, 2022: 686-694. [20] HUANG S N, OTA K, DONG M X, et al. MultiSpectralNet: spectral clustering using deep neural network for multi-view data[J]. IEEE Transactions on Computational Social Systems, 2019, 6(4): 749-760. [21] BOUBEKKI A, KAMPFFMEYER M, BREFELD U, et al. Joint optimization of an autoencoder for clustering and embedding[J]. Machine Learning, 2021, 110(7): 1901-1937. [22] YANG L X, CHEUNG N M, LI J Y, et al. Deep clustering by Gaussian mixture variational autoencoders with graph embedding[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 6439-6448. [23] WANG T S, LIU L, ZHANG H X, et al. Joint character-level convolutional and generative adversarial networks for text classification[J]. Complexity, 2020(1): 8516216. [24] BAI R N, HUANG R Z, ZHENG L Y, et al. Structure enhanced deep clustering network via a weighted neighbourhood auto-encoder[J]. Neural Networks, 2022, 155: 144-154. [25] AI W, LI J B, WANG Z, et al. Graph contrastive learning via cluster-refined negative sampling for semi-supervised text classification[EB/OL]. [2025-01-20]. https://arxiv.org/abs/2410.18130. [26] GUAN R C, ZHANG H, LIANG Y C, et al. Deep feature-based text clustering and its explanation[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(8): 3669-3680. [27] HOSSEINI S, VARZANEH Z A. Deep text clustering using stacked autoencoder[J]. Multimedia Tools and Applications, 2022, 81(8): 10861-10881. [28] 艾力米努尔·库尔班, 谢娟英, 姚若侠. 融合最近邻矩阵与局部密度的自适应K-means聚类算法[J]. 计算机科学与探索, 2023, 17(2): 355-366. AILIMINUR K, XIE J Y, YAO R X. Adaptive K-means algorithm combining nearest-neighbor matrix and local density[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(2): 355-366. [29] WU W H, WANG W W, JIA X X, et al. Transformer autoencoder for K-means efficient clustering[J]. Engineering Applications of Artificial Intelligence, 2024, 133: 108612. [30] 任丽娜, 秦永彬, 黄瑞章, 等. 基于多层子空间语义融合的深度文本聚类[J]. 计算机应用研究, 2023, 40(1): 70-74. REN L N, QIN Y B, HUANG R Z, et al. Deep document clustering model via multi-layer subspace semantic fusion[J]. Application Research of Computers, 2023, 40(1): 70-74. [31] SHAHAM U, STANTON K, LI H, et al. SpectralNet: spectral clustering using deep neural networks[EB/OL]. [2025-01-22]. https://arxiv.org/abs/1801.01587. [32] DUAN L, MA S, AGGARWAL C, et al. Improving spectral clustering with deep embedding, cluster estimation and metric learning[J]. Knowledge and Information Systems, 2021, 63(3): 675-694. [33] AFFELDT S, LABIOD L, NADIF M. Spectral clustering via ensemble deep autoencoder learning (SC-EDAE)[J]. Pattern Recognition, 2020, 108: 107522. [34] 李文博, 刘波, 陶玲玲, 等. L1正则化的深度谱聚类算法[J]. 计算机应用, 2023, 43(12): 3662-3667. LI W B, LIU B, TAO L L, et al. Deep spectral clustering algorithm with L1 regularization[J]. Journal of Computer Applications, 2023, 43(12): 3662-3667. [35] PRIYANKA J H, PARVEEN N. DeepSkillNER: an automatic screening and ranking of resumes using hybrid deep learning and enhanced spectral clustering approach[J]. Multimedia Tools and Applications, 2024, 83(16): 47503-47530. [36] ALGHAMDI A. Leveraging spectral clustering and long short-term memory techniques for green hotel recommendations in Saudi Arabia[J]. Sustainability, 2025, 17(5): 2328. [37] SONG C F, LIU F, HUANG Y Z, et al. Auto-encoder based data clustering[C]//Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress. Berlin, Heidelberg: Springer, 2013: 117-124. [38] XIE J Y, GIRSHICK R, FARHADI A. Unsupervised deep embedding for clustering analysis[C]//Proceedings of the 33rd International Conference on Machine Learning, 2016: 478-487. [39] GUO X F, GAO L, LIU X W, et al. Improved deep embedded clustering with local structure preservation[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017: 1753-1759. [40] OPOCHINSKY Y, CHAZAN S E, GANNOT S, et al. K-autoencoders deep clustering[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2020: 4037-4041. [41] HOU H W, DING S F, XU X. A deep clustering by multi-level feature fusion[J]. International Journal of Machine Learning and Cybernetics, 2022, 13(10): 2813-2823. [42] ZHANG D J, SUN Y F, ERIKSSON B, et al. Deep unsupervised clustering using mixture of autoencoders[EB/OL]. [2025-01-22]. https://arxiv.org/abs/1712.07788. [43] BO D Y, WANG X, SHI C, et al. Structural deep clustering network[C]//Proceedings of the Web Conference 2020. New York: ACM, 2020: 1400-1410. [44] 陆辉, 黄瑞章, 薛菁菁, 等. 深度动态文本聚类模型DDDC[J]. 计算机应用, 2023, 43(8): 2370-2375. LU H, HUANG R Z, XUE J J, et al. DDDC: deep dynamic document clustering model[J]. Journal of Computer Applications, 2023, 43(8): 2370-2375. [45] XIAO S X, WANG S P, GUO W Z. SGAE: stacked graph autoencoder for deep clustering[J]. IEEE Transactions on Big Data, 2023, 9(1): 254-266. [46] LI M L, CAO C, LI C G, et al. Deep embedding clustering based on residual autoencoder[J]. Neural Processing Letters, 2024, 56(2): 127. [47] KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2025-01-22]. https://arxiv.org/abs/1312.6114. [48] LIM K L, JIANG X D, YI C Y. Deep clustering with variational autoencoder[J]. IEEE Signal Processing Letters, 2020, 27: 231-235. [49] GRAVING J M, COUZIN I D. VAE-SNE: a deep generative model for simultaneous dimensionality reduction and clustering[J]. BioRxiv, 2020. DOI:10.1101/2020.07.17.207993. [50] MA H. Achieving deep clustering through the use of variational autoencoders and similarity-based loss[J]. Mathematical Biosciences and Engineering, 2022, 19(10): 10344-10360. [51] BAI R N, HUANG R Z, QIN Y B, et al. HVAE: a deep generative model via hierarchical variational auto-encoder for multi-view document modeling[J]. Information Sciences, 2023, 623: 40-55. [52] LU H, CHENG Z Y, HUANG R Z, et al. DEDC-IMAE: a deep evolutionary document clustering model with inherited mixed autoencoder[J]. Information Sciences, 2024, 678: 120880. [53] CHOWDHURY M H, HIROSE Y, MARSLAND S, et al. Unsupervised clustering using a variational autoencoder with constrained mixtures for posterior and prior[C]//Proceedings of the 21st Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence. Singapore: Springer, 2024: 29-40. [54] WANG Q Q, DING Z M, TAO Z Q, et al. Partial multi-view clustering via consistent GAN[C]//Proceedings of the 2018 IEEE International Conference on Data Mining. Piscataway: IEEE, 2018: 1290-1295. [55] LI Z, WANG Q, TAO Z, et al. Deep adversarial multi-view clustering network[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019. [56] YANG W Q, WANG M H, TANG C, et al. Trustworthy multi-view clustering via alternating generative adversarial representation learning and fusion[J]. Information Fusion, 2024, 107: 102323. [57] WANG Q Q, TAO Z Q, XIA W, et al. Adversarial multiview clustering networks with adaptive fusion[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(10): 7635-7647. [58] GUO X F, LIU X W, ZHU E, et al. Deep clustering with convolutional autoencoders[C]//Proceedings of the 24th International Conference on Neural Information Processing. Cham: Springer, 2017: 373-382. [59] ZHAO H Y, XIE J Z, WANG H B. Graph convolutional network based on multi-head pooling for short text classification[J]. IEEE Access, 2022, 10: 11947-11956. [60] YE Z H, JIANG G Y, LIU Y, et al. Document and word representations generated by graph convolutional network and bert for short text classification[C]//Proceedings of the 24th European Conference on Artificial Intelligence, 2020: 2275-2281. [61] HUA J W, SUN D B, HU Y X, et al. Heterogeneous graph-convolution-network-based short-text classification[J]. Applied Sciences, 2024, 14(6): 2279. [62] LIN H B, ZHANG X, FU X H. A graph convolutional encoder and decoder model for rumor detection[C]//Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics. Piscataway: IEEE, 2020: 300-306. [63] SUN M Z, ZHANG X, ZHENG J Q, et al. DDGCN: dual dynamic graph convolutional networks for rumor detection on social media[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(4): 4611-4619. [64] RAGESH R, SELLAMANICKAM S, IYER A, et al. Hete-GCN: heterogeneous graph convolutional networks for text classification[C]//Proceedings of the 14th ACM International Conference on Web Search and Data Mining. New York: ACM, 2021: 860-868. [65] LU H, CHEN C, WEI H, et al. Improved deep convolutional embedded clustering with re-selectable sample training[J]. Pattern Recognition, 2022, 127: 108611. [66] XIA W, WANG Q Q, GAO Q X, et al. Self-supervised graph convolutional network for multi-view clustering[J]. IEEE Transactions on Multimedia, 2022, 24: 3182-3192. [67] DU G W, ZHOU L H, LI Z X, et al. Neighbor-aware deep multi-view clustering via graph convolutional network[J]. Information Fusion, 2023, 93: 330-343. [68] PENG Z H, LIU H, JIA Y H, et al. Attention-driven graph clustering network[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 935-943. [69] CHEN C, LU H, HONG H T, et al. Deep self-supervised graph attention convolution autoencoder for networks clustering[J]. IEEE Transactions on Consumer Electronics, 2023, 69(4): 974-983. [70] ZHANG Y W, WANG Z H, SHANG J B. ClusterLLM: large language models as a guide for text clustering[EB/OL]. [2025-01-28]. https://arxiv.org/abs/2305.14871. [71] VISWANATHAN V, GASHTEOVSKI K, LAWRENCE C, et al. Large language models enable few-shot clustering[EB/OL]. [2025-01-28]. https://arxiv.org/abs/2307.00524. [72] JO H, LEE H, YOO K M, et al. ZeroDL: zero-shot distribution learning for text clustering via large language models[EB/OL]. [2025-01-28]. https://arxiv.org/abs/2406.13342. [73] HUANG R Y, SMALL C R. CafeLLM: context-aware fine-grained semantic clustering using large language models[C]//Proceedings of the 2nd International Workshop on Generalizing from Limited Resources in the Open World. Singapore: Springer, 2024: 66-81. [74] TIPIRNENI S, ADKATHIMAR R, CHOUDHARY N, et al. Context-aware clustering using large language models[EB/OL]. [2025-01-28]. https://arxiv.org/abs/2405.00988. [75] TAREKEGN A N. Large language model enhanced clustering for news event detection[EB/OL]. [2025-01-30]. https://arxiv.org/abs/2406.10552. [76] PETUKHOVA A, MATOS-CARVALHO J P, FACHADA N. Text clustering with large language model embeddings[EB/OL]. [2025-01-30]. https://arxiv.org/abs/2403.15112. [77] YANG C, CAO B, FAN J. TeC: a novel method for text clustering with large language models guidance and weakly-supervised contrastive learning[C]//Proceedings of the 18th International AAAI Conference on Web and Social Media, 2024: 1702-1712. [78] HUANG C, HE G X. Text clustering as classification with LLMs[EB/OL]. [2025-01-30]. https://arxiv.org/abs/2410.00927. [79] KERAGHEL I, MORBIEU S, NADIF M. Beyond words: a comparative analysis of LLM embeddings for effective clustering[C]//Advances in Intelligent Data Analysis XXII. Cham: Springer, 2024: 205-216. |
| [1] | 魏宗月, 仇大伟, 刘静, 李振江, 常少华. 深度学习在上肢骨折诊断中的研究进展[J]. 计算机科学与探索, 2025, 19(9): 2341-2362. |
| [2] | 化春键, 姚烨涛, 蒋毅, 俞建峰, 陈莹. 特征增强和渐进式解码的RGB-D显著性检测[J]. 计算机科学与探索, 2025, 19(9): 2419-2429. |
| [3] | 杨彬, 马廷淮, 黄学坚, 王宇博, 王朝明, 赵博文, 于信. 基于时空特征融合与序列重构的时间序列异常检测[J]. 计算机科学与探索, 2025, 19(9): 2384-2398. |
| [4] | 刘晓佳, 陈泓妤, 于德新, 陈云结, 周宇琴. 面向自动驾驶汽车长短时特性的轨迹预测综述[J]. 计算机科学与探索, 2025, 19(9): 2363-2383. |
| [5] | 昂格鲁玛, 王斯日古楞, 斯琴图. 知识图谱补全研究综述[J]. 计算机科学与探索, 2025, 19(9): 2302-2318. |
| [6] | 董甲东, 桑飞虎, 郭庆虎, 陈琳, 郑春香. 基于深度学习的目标检测算法轻量化研究综述[J]. 计算机科学与探索, 2025, 19(8): 2057-2084. |
| [7] | 王劲滔, 孟琪翔, 高志霖, 卜凡亮. 基于大语言模型指令微调的案件信息要素抽取方法研究[J]. 计算机科学与探索, 2025, 19(8): 2161-2173. |
| [8] | 闵锋, 刘宇卓, 刘煜晖, 刘彪. 基于时域频域混合特征的多变量时序预测模型[J]. 计算机科学与探索, 2025, 19(8): 2099-2109. |
| [9] | 洪维, 耿沛霖, 王弘宇, 张雪芹, 顾春华. 结合图像显著性区域的局部动态干净标签后门攻击[J]. 计算机科学与探索, 2025, 19(8): 2229-2240. |
| [10] | 张亦菲, 李艳玲, 葛凤培. 基于图深度学习的司法判决预测综述[J]. 计算机科学与探索, 2025, 19(8): 2024-2042. |
| [11] | 田崇腾, 刘静, 王晓燕, 李明. 大语言模型GPT在医疗文本中的应用综述[J]. 计算机科学与探索, 2025, 19(8): 2043-2056. |
| [12] | 许光远, 张亚强, 史宏志. 面向大规模DNN训练场景的容错技术综述[J]. 计算机科学与探索, 2025, 19(7): 1771-1788. |
| [13] | 陈旭, 张其, 王叔洋, 景永俊. 自适应积空间离散动态图链接预测模型[J]. 计算机科学与探索, 2025, 19(7): 1820-1831. |
| [14] | 崔健, 汪永伟, 李飞扬, 李强, 苏北荣, 张小健. 结合知识蒸馏的中文文本摘要生成方法[J]. 计算机科学与探索, 2025, 19(7): 1899-1908. |
| [15] | 周开军, 廖婷, 谭平, 史长发. 图像压缩技术研究综述[J]. 计算机科学与探索, 2025, 19(7): 1699-1728. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||