用户意图补充的半监督深度文本聚类

doi:10.3778/j.issn.1673-9418.2203064

摘要/Abstract

摘要： 传统的文本聚类算法通过衡量文本间相似度对数据样本进行类簇划分，但无法根据用户给定的少量监督信息挖掘用户对聚类结果的主观意图。随着应用场景的多样化发展，同一数据集在不同的用户意图指导下聚类结果可能不唯一，如何得到遵循用户意图的聚类结果是当前研究的问题之一；同时，用户给定的监督信息是少量的，如何根据少量的监督信息最大程度地学习到用户的聚类意图，是研究的另一问题。为此，提出一种挖掘和补充用户意图的半监督深度文本聚类模型（SDDCS）。SDDCS根据用户给定的监督信息，构造意图矩阵挖掘用户意图；根据矩阵分解与补充算法对意图矩阵中的未知元素进行补充，进而最大程度地学习到用户意图。利用补充后的意图矩阵指导文本聚类过程，将用户意图作为聚类依据之一，最终得到符合用户意图的聚类结果。在四个公开文本数据集上的实验表明，SDDCS的聚类性能更高，其有效性得到了证明。

关键词: 意图, 矩阵补充, 半监督, 文本聚类

Abstract: Traditional document clustering algorithms classify data by measuring the similarity between documents. But they can??t mine users' subjective intention of clustering results according to a small amount of supervision information given by users. With the development of the diversified application scenarios, the clustering results of the same dataset under the guidance of different users?? intentions may not be unique. How to obtain the clustering results following users' intentions is one of the problems in the current research. Besides, there is a small amount of supervision information given by users. How to learn the clustering intention of users to the greatest extent according to a small amount of supervision information is another problem. Therefore, a semi-supervised deep document clustering model with supplemented intention (SDDCS) is proposed. According to the supervision information given by the user, SDDCS constructs an intention matrix to mine the user's intention. The unknown elements in the intention matrix are supplemented according to the matrix factorization and supplement algorithm, so as to learn the users' intention to the greatest extent. The supplementary intention matrix is used to guide the document clustering process, and the user's intention is taken as one of the clustering bases. Finally, the clustering results in line with the user's intention are obtained. Experiments on four public document datasets show that the clustering performance of SDDCS is higher, and its effectiveness is proven.

Key words: intention, matrix supplemented, semi-supervised, document clustering

李静楠, 黄瑞章, 任丽娜. 用户意图补充的半监督深度文本聚类[J]. 计算机科学与探索, 2023, 17(8): 1928-1937.

LI Jingnan, HUANG Ruizhang, REN Lina. Semi-supervised Deep Document Clustering Model with Supplemented User Intention[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(8): 1928-1937.

参考文献

[1] 秦悦, 丁世飞. 半监督聚类综述[J]. 计算机科学, 2019, 46(9): 15-21.
QIN Y, DING S F. Survey of semi-supervised clustering[J]. Computer Science, 2019, 46(9): 15-21.
[2] ZHOU D Y, BOUSQUET O, LAL T N, et al. Learning with local and global consistency[C]//Advances in Neural Information Processing Systems 16, Vancouver and Whistler, Dec 8-13, 2003. Cambridge: MIT Press, 2004: 321-328.
[3] BASU S, BANERJEE A, MOONEY R J. Semi-supervised clustering by seeding[C]//Proceedings of the 19th International Conference on Machine Learning, Sydney, Jul 8-12, 2002. San Mateo: Morgan Kaufmann, 2002: 27-34.
[4] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Jun 21-Jul 18, 1965, Dec 27, 1965-Jan 7, 1966. Berkeley: University of California, 1967: 281-297.
[5] HARTIGAN J A, WONG M A. Algorithm AS 136: a k-means clustering algorithm[J]. Journal of the Royal Statistical Society, Series C (Applied Statistics), 1979, 28(1): 100-108.
[6] LI X, YIN H, ZHOU K, et al. Semi-supervised clustering with deep metric learning and graph embedding[J]. World Wide Web: Internet and Web Information Systems, 2020, 23(2): 781-798.
[7] WAGSTAFF K, CARDIE C, ROGERS S, et al. Constrained K-means clustering with background knowledge[C]//Proceedings of the 18th International Conference on Machine Learning, Williamstown, Jun 28-Jul 1, 2001. San Mateo: Morgan Kaufmann, 2001: 577-584.
[8] WEI S, LI Z, ZHANG C. Combined constraint-based with metric-based in semi-supervised clustering ensemble[J]. International Journal of Machine Learning and Cybernetics, 2018, 9(7): 1085-1100.
[9] MASUD M A, HUANG J M, MHONG M, et al. Generate pairwise constraints from unlabeled data for semi-supervised clustering[J]. Data & Knowledge Engineering, 2019, 123: 101715.
[10] MEI J P, LV H J, CAO J W, et al. Pairwise constrained fuzzy clustering: relation, comparison and parallelization[J]. International Journal of Fuzzy Systems, 2019, 21(6): 1938-1949.
[11] FOGEL S, AVERBUCH-ELOR H, COHEN-OR D, et al. Clustering-driven deep embedding with pairwise constraints[J]. IEEE Computer Graphics and Applications, 2019, 39(4): 16-27.
[12] YANG X, DENG C, ZHENG F, et al. Deep spectral clustering using dual autoencoder network[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4066-4075.
[13] LV J, KANG Z, LU X, et al. Pseudo-supervised deep subspace clustering[J]. IEEE Transactions on Image Processing, 2021, 30: 5252-5263.
[14] LI Y F, HU P, LIU Z, et al. Contrastive clustering[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, the 31st Conference on Innovative Applications of Artificial Intelligence, the 11th Symposium on Educational Advances in Artificial Intelligence, Feb 2-9, 2021. Menlo Park: AAAI, 2021: 8547-8555.
[15] PADMASUNDARI S B. Intent discovery through unsupervised semantic text clustering[C]//Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, Sep 2-6, 2018: 606-610.
[16] LIN T E, XU H, ZHANG H. Discovering new intents via constrained deep adaptive clustering with cluster refinement[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 8360-8367.
[17] VEDULA N, LIPKA N, MANERIKER P, et al. Open intent extraction from natural language interactions[C]//Proceedings of the Web Conference 2020, Taipei, China, Apr 20-24, 2020. New York: ACM, 2020: 2009-2020.
[18] LIU P, NING Y, WU K K, et al. Open intent discovery through unsupervised semantic clustering and dependency parsing[J]. arXiv:2104.12114, 2021.
[19] CANDèS E J, RECHT B. Exact matrix completion via convex optimization[J]. Foundations of Computational Mathematics, 2009, 9(6): 717-772.
[20] MA S, GOLDFARB D, CHEN L. Fixed point and Bregman iterative methods for matrix rank minimization[J]. Mathematical Programming, 2011, 128(1): 321-353.
[21] 史加荣, 郑秀云, 周水生. 矩阵补全算法研究进展[J]. 计算机科学, 2014, 41(4): 13-20.
SHI J R, ZHENG X Y, ZHOU S S. Research progress in matrix completion algorithms[J]. Computer Science, 2014, 41(4): 13-20.
[22] 陈蕾, 陈松灿. 矩阵补全模型及其算法研究综述[J]. 软件学报, 2017, 28(6): 1547-1564.
CHEN L, CHEN S C. Survey on matrix completion models and algorithms[J]. Journal of Software, 2017, 28(6): 1547-1564.
[23] GáLVEZ-LóPEZ D, TARDóS J D. Real-time loop detection with bags of binary words[C]//Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, Sep 25-30, 2011. Piscataway: IEEE, 2011: 51-58.
[24] ZHANG D H, BACLAWSKI K P, TSOTRAS V, et al. Encyclopedia of database systems[M]. Berlin, Heidelberg: Springer, 2009.
[25] ACKLEY D H, HINTON G E, SEJNOWSKI T J. A learning algorithm for Boltzmann machines[J]. Cognitive Science, 1985, 9(1): 147-169.
[26] VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9(11): 2579-2605.
[27] MALININ A, GALES M. Reverse KL-divergence training of prior networks: improved uncertainty and adversarial robustness[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2019, Vancouver, Dec 8-14, 2019: 14520-14531.
[28] XIE J, GIRSHICK R, FARHADI A. Unsupervised deep embedding for clustering analysis[C]//Proceedings of the 33rd International Conference on Machine Learning, New York, Jun 19-24, 2016: 478-487.
[29] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[30] KINGMA D P, WELLING M. Auto-encoding variational Bayes[C]//Proceedings of the 2nd International Conference on Learning Representations, Banff, Apr 14-16, 2014: 3.
[31] WANG W, HUANG Y, WANG Y, et al. Generalized autoencoder: a neural network framework for dimensionality reduction[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 496-503.
[32] KIPF T N, WELLING M. Variational graph auto-encoders[J]. arXiv:1611.07308, 2016.
[33] GUO X F, GAO L, LIU X W, et al. Improved deep embedded clustering with local structure preservation[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Aug 19-25, 2017: 1753-1759.