计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (8): 1928-1937.DOI: 10.3778/j.issn.1673-9418.2203064

• 人工智能·模式识别 • 上一篇    下一篇

用户意图补充的半监督深度文本聚类

李静楠,黄瑞章,任丽娜   

  1. 1. 贵州大学 公共大数据国家重点实验室,贵阳 550025
    2. 贵州大学 计算机科学与技术学院,贵阳 550025
  • 出版日期:2023-08-01 发布日期:2023-08-01

Semi-supervised Deep Document Clustering Model with Supplemented User Intention

LI Jingnan, HUANG Ruizhang, REN Lina   

  1. 1. State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
    2. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
  • Online:2023-08-01 Published:2023-08-01

摘要: 传统的文本聚类算法通过衡量文本间相似度对数据样本进行类簇划分,但无法根据用户给定的少量监督信息挖掘用户对聚类结果的主观意图。随着应用场景的多样化发展,同一数据集在不同的用户意图指导下聚类结果可能不唯一,如何得到遵循用户意图的聚类结果是当前研究的问题之一;同时,用户给定的监督信息是少量的,如何根据少量的监督信息最大程度地学习到用户的聚类意图,是研究的另一问题。为此,提出一种挖掘和补充用户意图的半监督深度文本聚类模型(SDDCS)。SDDCS根据用户给定的监督信息,构造意图矩阵挖掘用户意图;根据矩阵分解与补充算法对意图矩阵中的未知元素进行补充,进而最大程度地学习到用户意图。利用补充后的意图矩阵指导文本聚类过程,将用户意图作为聚类依据之一,最终得到符合用户意图的聚类结果。在四个公开文本数据集上的实验表明,SDDCS的聚类性能更高,其有效性得到了证明。

关键词: 意图, 矩阵补充, 半监督, 文本聚类

Abstract: Traditional document clustering algorithms classify data by measuring the similarity between documents. But they can??t mine users' subjective intention of clustering results according to a small amount of supervision information given by users. With the development of the diversified application scenarios, the clustering results of the same dataset under the guidance of different users?? intentions may not be unique. How to obtain the clustering results following users' intentions is one of the problems in the current research. Besides, there is a small amount of supervision information given by users. How to learn the clustering intention of users to the greatest extent according to a small amount of supervision information is another problem. Therefore, a semi-supervised deep document clustering model with supplemented intention (SDDCS) is proposed. According to the supervision information given by the user, SDDCS constructs an intention matrix to mine the user's intention. The unknown elements in the intention matrix are supplemented according to the matrix factorization and supplement algorithm, so as to learn the users' intention to the greatest extent. The supplementary intention matrix is used to guide the document clustering process, and the user's intention is taken as one of the clustering bases. Finally, the clustering results in line with the user's intention are obtained. Experiments on four public document datasets show that the clustering performance of SDDCS is higher, and its effectiveness is proven.

Key words: intention, matrix supplemented, semi-supervised, document clustering