Journal of Frontiers of Computer Science and Technology ›› 2025, Vol. 19 ›› Issue (11): 2873-2894.DOI: 10.3778/j.issn.1673-9418.2502024

• Frontiers·Surveys • Previous Articles     Next Articles

Advances in Text Clustering Models Based on Deep Learning Approaches

SHI Dongyan, MA Lerong, DING Cangfeng, NING Qinwei, CAO Jiangjiang   

  1. College of Mathematics and Computer Science, Yan'an University, Yan'an, Shaanxi 716000, China
  • Online:2025-11-01 Published:2025-10-30

深度学习方法下的文本聚类模型研究进展

史东艳,马乐荣,丁苍峰,宁秦伟,曹江江   

  1. 延安大学 数学与计算机科学学院,陕西 延安 716000

Abstract: Text clustering is one of the core techniques in unsupervised learning, aiming to automatically partition large text datasets into clusters with high semantic similarity. In recent years, deep learning-based text clustering has flourished, with research focus shifting towards utilizing advanced deep learning architectures to efficiently extract text features, thereby improving clustering accuracy. Particularly, clustering strategies relying on large pre-trained language models like RoBERTa and GPT have demonstrated exceptional performance due to their powerful pre-trained feature representations. Through examples and data, this paper comprehensively reviews the development, current progress, and task characteristics of text clustering, aiming to present its latest trends and significant impact in data mining. An innovative classification method for text clustering models based on deep learning architecture features is proposed. This classification method divides models based on their core mechanisms and feature extraction paths in clustering tasks, covering a comprehensive introduction to methods ranging from traditional clustering algorithms to advanced technologies, including K-means, spectral clustering, autoencoders, generative models, graph convolutional networks, and large language models, with detailed analysis of their specific implementations. Finally, the advantages and limitations of existing methods are analyzed, and potential future research directions are discussed.

Key words: feature representation, text clustering, deep learning, large language models

摘要: 文本聚类是无监督学习的核心技术之一,其目标是将海量文本数据自动划分为若干语义高度相似的簇。近年来,基于深度学习的文本聚类取得蓬勃发展,研究焦点逐步转向利用先进的深度学习架构来高效提取文本特征,以进一步提高聚类结果的准确性。特别是依托RoBERTa和GPT等大型预训练语言模型的聚类策略,凭借其强大的预训练特征表示能力,已展现出卓越的性能优势。通过实例和数据的方式,全面回顾了文本聚类的发展历程、当前进展及其任务特性,旨在直观呈现其最新发展趋势及在数据挖掘领域的重要影响力。创新性地提出了一种面向深度学习架构特征的文本聚类模型分类方式。该分类方式依据模型在聚类任务中的核心机制与特征提取路径进行划分,内容涵盖从传统聚类算法到前沿技术的全面介绍,包括K-means、谱聚类、自编码器、生成模型、图卷积神经网络以及大型语言模型等多种方法,并对其具体实现细节进行深入分析。最后分析了现有方法的优势与局限,并在此基础上探讨未来可能的研究方向。

关键词: 特征表示, 文本聚类, 深度学习, 大语言模型