计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (12): 2861-2879.DOI: 10.3778/j.issn.1673-9418.2303083
何东彬,陶莎,朱艳红,任延昭,褚云霞
出版日期:
2023-12-01
发布日期:
2023-12-01
HE Dongbin, TAO Sha, ZHU Yanhong, REN Yanzhao, CHU Yunxia
Online:
2023-12-01
Published:
2023-12-01
摘要: 主题模型常用于非结构化语料库和离散数据建模,抽取隐含主题分布。由于主题发现结果采用词列表形式,理解其含义较为困难。尽管通过人工标记可生成更具解释性和易理解的主题标签,但成本巨大缺乏可行性,而自动主题标记的研究为解决该问题提供了方法和思路。首先对当前最为流行的狄利克雷分配主题模型进行阐述与分析,并根据主题标签三种不同表现形式,基于短语、摘要和图片,将主题标记方法分为三种类型;之后围绕提高主题的可解释性,以生成的不同类型主题标签为线索,对近年来的相关研究成果进行梳理、分析和总结,并对不同标签的适用情境和可用性进行探讨;同时根据不同方法的特点进一步分类,重点对基于词法、子模优化和图排序方法生成摘要主题标签进行定量和定性分析,从学习类型、使用技术和数据来源出发,对比不同方法的差异;最后对主题自动标记研究存在的问题和趋势发展进行讨论,基于深度学习、与情感分析结合并不断拓展主题标记应用的场景,将是未来发展的重点和方向。
何东彬, 陶莎, 朱艳红, 任延昭, 褚云霞. 主题模型自动标记方法研究综述[J]. 计算机科学与探索, 2023, 17(12): 2861-2879.
HE Dongbin, TAO Sha, ZHU Yanhong, REN Yanzhao, CHU Yunxia. Survey of Automatic Labeling Methods for Topic Models[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(12): 2861-2879.
[1] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet alloca-tion[J]. Journal of Machine Learning Research, 2003, 3: 993-1022. [2] MEI Q, SHEN X, ZHAI C. Automatic labeling of multinomial topic models[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, Aug 12-15, 2007. New York: ACM, 2007: 490-499. [3] KOU W, LI F, BALDWIN T. Automatic labelling of topic models using word vectors and letter trigram vectors[C]//LNCS 9460: Proceedings of the 11th Asia Information Ret-rieval Societies Conference on Information Retrieval Tech-nology, Brisbane, Dec 2-4, 2015. Cham: Springer, 2015: 253-264. [4] WAN X, WANG T. Automatic labeling of topic models using text summaries[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Aug 7-12, 2016. Stroudsburg: ACL, 2017: 2297-2305. [5] MEI Q, ZHAI C. Discovering evolutionary theme patterns from text: an exploration of temporal text mining[C]//Pro-ceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, Aug 21-24, 2005. New York: ACM, 2005: 198-207. [6] MEI Q, LIU C, SU H, et al. A probabilistic approach to spatiotemporal theme pattern mining on weblogs[C]//Pro-ceedings of the 15th International Conference on World Wide Web, Edinburgh, May 23-26, 2006. New York: ACM, 2006: 533-542. [7] LAU J H, GRIESER K, NEWMAN D, et al. Automatic la-belling of topic models[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Jun 19-24, 2011. Stroudsburg: ACL, 2011: 1536-1545. [8] MAGATTI D, CALEGARI S, CIUCCI D, et al. Automatic labeling of topics[C]//Proceedings of the 9th International Conference on Intelligent Systems Design and Applications, Pisa, Nov 30-Dec 2, 2009. Washington: IEEE Computer Society, 2009: 1227-1232. [9] 凌洪飞, 欧石燕. 面向主题模型的主题自动语义标注研究综述[J]. 数据分析与知识发现, 2019, 3(9): 16-26. LIN H F, OU S Y. Review of automatic semantic labeling for topic models[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 16-26. [10] SALTON G, WONG A, YANG C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620. [11] TURNEY P D, PANTEL P. From frequency to meaning: vector space models of semantics[J]. Journal of Artificial Intelligence Research, 2010, 37: 141-188. [12] DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the Ame-rican Society for Information Science, 1990, 41(6): 391-407. [13] ZHAO W Z, MA H F, HE Q. Parallel K-means clustering based on MapReduce[C]//LNCS 5931: Proceedings of the 1st International Conference on Cloud Computing. Berlin,Heidelberg: Springer, 2009: 674-679. [14] 周厚奎. 概率主题模型的研究及其在多媒体主题发现和演化中的应用 [D]. 杭州: 浙江大学, 2017. ZHOU H K. Research on probabilistic topic model and its application in multimedia topic discovery and evolution[D]. Hangzhou: Zhejiang University, 2017. [15] HOFMANN T. Probabilistic latent semantic indexing[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, Aug 15-19, 1999. New York: ACM, 1999: 50-57. [16] TEH Y W, NEWMAN D, WELLING M. A collapsed varia-tional Bayesian inference algorithm for latent Dirichlet allo-cation[C]//Advances in Neural Information Processing Sys-tems 19, Vancouver, Dec 4-7, 2006. Cambridge: MIT Press, 2007: 1353-1360. [17] PORTEOUS I, NEWMAN D, IHLER A, et al. Fast collapsed Gibbs sampling for latent Dirichlet allocation[C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Aug 24-27, 2008. New York: ACM, 2008: 569-577. [18] CHRISTOU D. Feature extraction using latent Dirichlet allo-cation and neural networks: a case study on movie synopses [J]. arXiv:1604.01272, 2016. [19] MEHROTRA R, SANNER S, BUNTINE W, et al. Impro-ving LDA topic models for microblogs via Tweet pooling and automatic labeling[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Jul 28-Aug 1, 2013. New York: ACM, 2013: 889-892. [20] JEON H B, LEE S Y. Language model adaptation based on topic probability of latent Dirichlet allocation[J]. ETRI Journal, 2016, 38(3): 487-493. [21] SANTANIELLO D, COLACE F, LOMBARDI M, et al. Sentiment analysis in social networks: a methodology based on the latent Dirichlet allocation approach[C]//Proceedings of the 11th Conference of the European Society for Fuzzy Logic and Technology, Prague, Sep 9-13, 2019. Amsterdam: Atlantis Press, 2019: 1-8. [22] ALETRAS N, MITTAL A. Labeling topics with images using a neural network[C]//LNCS 10193: Proceedings of the 39th European Conference on IR Research, Aberdeen, Apr 8-13, 2017. Cham: Springer, 2017: 500-505. [23] ALETRAS N, STEVENSON M. Labelling topics using unsupervised graph-based methods[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Jun 22-27, 2014. Stroudsburg: ACL, 2014: 631-636. [24] HULPUS I, HAYES C, KARNSTEDT M, et al. Unsupervised graph-based topic labelling using DBpedia[C]//Proceedings of the 6th ACM International Conference on Web Search and Data Mining, Rome, Feb 4-8, 2013. New York: ACM, 2013: 465-474. [25] BHATIA S, LAU J H, BALDWIN T. Automatic labelling of topics with neural embeddings[C]//Proceedings of the 26th International Conference on Computational Linguistics, Osaka, Dec 11-16, 2016. Stroudsburg: ACL, 2016: 953-963. [26] ALOKAILI A, ALETRAS N, STEVENSON M. Re-ranking words to improve interpretability of automatically generated topics[C]//Proceedings of the 13th International Conference on Computational Semantics, Gothenburg, May 23-27, 2019. Stroudsburg: ACL, 2019: 43-54. [27] KIM H H, RHEE H Y. An ontology-based labeling of in-fluential topics using topic network analysis[J]. Journal of Information Processing Systems, 2019, 15(5): 1096-1107. [28] SANJAYA N A, BA M L, ABDESSALEM T, et al. Harnes-sing truth discovery algorithms on the topic labelling pro-blem[C]//Proceedings of the 20th International Conference on Information Integration and Web-based Applications & Services, Yogyakarta, Nov 19-21, 2018. New York: ACM, 2018: 8-14. [29] KOZONO R, SAGA R. Automatic labeling for hierarchical topics with NETL[C]//Proceedings of the 2020 IEEE Inter-national Conference on Systems, Man, and Cybernetics, To-ronto, Oct 11-14, 2020. Piscataway: IEEE, 2020: 3740-3745. [30] ZOSA E, PIVOVAROVA L, BOGGIA M, et al. Multilingual topic labelling of news topics using ontological mapping[C]//LNCS 13186: Proceedings of the 44th European Con-ference on IR Research, Stavanger, Apr 10-14, 2022. Cham: Springer, 2022: 248-256. [31] POPA C, REBEDEA T. BART-TL: weakly-supervised topic label generation[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Apr 19-23, 2021. Stroudsburg: ACL, 2021: 1418-1425. [32] KINARIWALA S A, DESHMUKH S. Onto_TML: auto-labeling of topic models[J]. Journal of Integrated Science and Technology, 2021, 9(2): 85-91. [33] ALOKAILI A, ALETRAS N, STEVENSON M. Automatic generation of topic labels[C]//Proceedings of the 43rd Inter-national ACM SIGIR Conference on Research and Deve-lopment in Information Retrieval, Jul 25-30, 2020. New York: ACM, 2020: 1965-1968. [34] TIWARI P, TRIPATHI A, SINGH A, et al. Advanced hierar-chical topic labeling for short text[J]. IEEE Access, 2023,11: 35158-35174. [35] ALLAHYARIA M, POURIYEHA S, KOCHUTA K, et al. OntoLDA: an ontology-based topic model for automatic topic labeling[Z]. Amsterdam: IOS Press, 2009: 1-20. [36] SHAHRIAR K T, MONI M A, HOQUE M M, et al. SATLabel: a framework for sentiment and aspect terms based automatic topic labelling[C]//Proceedings of Machine Intelligence and Data Science Applications 2021, Cumilla, Dec 2021. Berlin, Heidelberg: Springer, 2022: 63-75. [37] HE D, WANG M, KHATTAK A M, et al. Automatic labeling of topic models using graph-based ranking[J]. IEEE Access, 2019, 7: 131593-131608. [38] BASAVE A E C, HE Y, XU R. Automatic labelling of topic models learned from twitter by summarisation[C]//Proceedings of the 52nd Annual Meeting of the Association for Compu-tational Linguistics, Baltimore, Jun 22-27, 2014. Stroudsburg: ACL, 2014: 618-624. [39] BARAWI M H, LIN C, SIDDHARTHAN A. Automatically labelling sentiment-bearing topics with descriptive sentence labels[C]//LNCS 10260: Proceedings of the 22nd International Conference on Applications of Natural Language to Infor-mation Systems, Liège, Jun 21-23, 2017. Cham: Springer, 2017: 299-312. [40] HE D, REN Y, KHATTAK A M, et al. Automatic topic la-beling model with paired-attention based on pre-trained deep neural network[C]//Proceedings of the 2021 International Joint Conference on Neural Networks, Shenzhen, Jul 18-22, 2021. Piscataway: IEEE, 2021: 1-9. [41] KOZBAGAROV O, MUSSABAYEV R, MLADENOVIC N. A new sentence-based interpretative topic modeling and automatic topic labeling[J]. Symmetry, 2021, 13(5): 837. [42] ALETRAS N, STEVENSON M. Representing topics using images[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies, Atlanta, Jun 9-14, 2013. Stroudsburg: ACL, 2013: 158-167. [43] SORODOC I, LAU J H, ALETRAS N, et al. Multimodal topic labelling[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2017: 701-706. [44] NGUYEN C T, ZHAN D C, ZHOU Z H. Multi-modal image annotation with multi-instance multi-label LDA[C]//Procee-dings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, Aug 3-9, 2013. Menlo Park: AAAI, 2013: 1558-1564. [45] ALETRAS N, BALDWIN T, LAU J H, et al. Evaluating topic representations for exploring document collections[J]. Journal of the Association for Information Science and Technology, 2017, 68(1): 154-167. [46] MAO X L, MING Z Y, ZHA Z J, et al. Automatic labeling hierarchical topics[C]//Proceedings of the 21st ACM Inter-national Conference on Information and Knowledge Man-agement. New York: ACM, 2012: 2383-2386. [47] REIMERS N, GUREVYCH I. Sentence-BERT: sentence em-beddings using siamese BERT-networks[J]. arXiv:1908.10084, 2019. [48] LEWIS M, LIU Y, GOYAL N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[J]. arXiv:1910.13461, 2019. [49] CHEN J, YAN J, ZHANG B, et al. Diverse topic phrase extraction through latent semantic analysis[C]//Proceedings of the 6th International Conference on Data Mining, Hong Kong, China, Dec 18-22, 2006. Washington: IEEE Computer Society, 2007: 834-838. [50] CHINCHOR N, ROBINSON P. MUC-7 named entity task definition[C]//Proceedings of the 7th Conference on Message Understanding. Stroudsburg: ACL, 1998: 1-21. [51] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their composi-tionality[C]//Proceedings of the 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, Dec 5-8, 2013. Red Hook: Curran Associates, 2013: 3111-3119. [52] LE Q V, MIKOLOV T. Distributed representations of sentences and documents[C]//Proceedings of the 31st International Con-ference on Machine Learning, Beijing, Jun 21-26, 2014: 1188-1196. [53] PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Lan-guage Processing, Doha, Oct 25-29, 2014. Stroudsburg: ACL, 2014: 1532-1543. [54] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North Ame-rican Chapter of the Association for Computational Linguis-tics: Human Language Technologies, Minneapolis, Jun 2-7, 2019. Stroudsburg: ACL, 2019: 4171-4186. [55] HULPUS I, HAYES C, KARNSTEDT M, et al. An eigen-value-based measure for word-sense disambiguation[C]//Pro-ceedings of the 25th International Florida Artificial Intelligence Research Society Conference, Marco Island, May 23-25, 2012. Menlo Park: AAAI, 2012: 1-6. [56] BOUMA G. Normalized (pointwise) mutual information in collocation extraction[C]//Proceedings of the 2009 International Conference of the German Society for Computational Lin-guistics and Language Technology, Potsdam, 2009: 31-40. [57] PAGE L, BRIN S, MOTWANI R, et al. The pagerank citation ranking: bringing order to the web[R]. Stanford InfoLab, 1999: 1-17. [58] SMITH A, LEE T Y, POURSABZI-SANGDEH F, et al. Evaluating visual representations for topic understanding and their effects on manually generated topic labels[J]. Transac-tions of the Association for Computational Linguistics, 2017, 5: 1-16. [59] CARBONELL J, GOLDSTEIN J. The use of MMR, diversity-based reranking for reordering documents and producing summaries[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Aug 24-28, 1998. New York: ACM, 1998: 335-336. [60] MIHALCEA R, TARAU P. TextRank: bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, A Meeting of SIGDAT, a Special Interest Group of the ACL, Held in Conjunction with ACL 2004, Barcelona, Jul 25-26, 2004. Stroudsburg: ACL, 2004: 404-411. [61] HE D, REN Y, KHATTAK A M, et al. Automatic topic labeling using graph-based pre-trained neural embedding[J]. Neurocomputing, 2021, 463: 596-608. [62] REN P, CHEN Z, REN Z, et al. Sentence relations for extrac-tive summarization with deep neural networks[J]. ACM Tran-sactions on Information Systems, 2018, 36(4): 1-32. [63] REN P, WEI F, ZHUMIN C, et al. A redundancy-aware sen-tence regression framework for extractive summarization[C]//Proceedings of the 26th International Conference on Com-putational Linguistics, Osaka, Dec 11-16, 2016. Stroudsburg: ACL, 2016: 33-43. [64] FUJISHIGE S. Submodular functions and optimization[M]. New York: Elsevier Science Inc., 2005. [65] LIN H, BILMES J. Multi-document summarization via bud-geted maximization of submodular functions[C]//Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Associa-tion for Computational Linguistics, Los Angeles, Jun 2-4, 2010. Stroudsburg: ACL, 2010: 912-920. [66] LIN H, BILMES J. A class of submodular functions for document summarization[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Jun 19-24, 2011. Stroudsburg: ACL, 2011: 510-520. [67] MALLICK C, DAS A K, DUTTA M, et al. Graph-based text summarization using modified TextRank[M]//Soft Computing in Data Analytics. Cham: Springer, 2019: 137-146. [68] BRIN S, PAGE L. The anatomy of a large-scale hypertextual web search engine[J]. Computer Networks and ISDN Systems, 1998, 30: 107-117. [69] ERKAN G, RADEV D R. LexRank: graph-based lexical cen-trality as salience in text summarization[J]. Journal of Artificial Intelligence Research, 2004, 22: 457-479. [70] LIU Y. Fine-tune BERT for extractive summarization[J]. arXiv:1903.10318, 2019. [71] LOWE D G. Object recognition from local scale-invariant features[C]//Proceedings of the 1999 International Conference on Computer Vision, Kerkyra, Sep 20-25, 1999. Washington: IEEE Computer Society, 1999: 1150-1157. [72] LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110. [73] ZHOU Z H, ZHANG M L. Multi-instance multi-label learning with application to scene classification[C]//Proceedings of the 2006 International Conference on Neural Information Processing Systems, Vancouver, Dec 4-7, 2006. Cambridge: MIT Press, 2006: 1609-1616. [74] LEVY O, GOLDBERG Y. Dependency-based word embed-dings[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Jun 22-27, 2014. Stroudsburg: ACL, 2014: 302-308. [75] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recogni-tion, Miami, Jun 20-25, 2009. Washington: IEEE Computer Society, 2009: 248-255. [76] SIMONYAN K, ZISSERMAN A. Very deep convolutional net-works for large-scale image recognition[J]. arXiv:1409.1556, 2014. [77] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155. [78] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:1301.3781, 2013. [79] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 5998-6008. [80] BASTANI K, NAMAVARI H, SHAFFER J. Latent Dirichlet allocation (LDA) for topic modeling of the CFPB consumer complaints[J]. Expert Systems with Applications, 2019, 127: 256-271. [81] SONG S, WANG C, CHEN H, et al. An emotional comfort framework for improving user satisfaction in E-commerce customer service chatbots[C]//Proceedings of the 2021 Con-ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, Jun 6-11, 2021. Stroudsburg: ACL, 2021: 130-137. |
[1] | 王世杰,周丽华,孔兵,周俊华. 基于LDA-DeepHawkes模型的信息级联预测[J]. 计算机科学与探索, 2020, 14(3): 410-425. |
[2] | 刘少钦,唐爽,赵俊峰,王亚沙,卓琳. 基于扩展主题模型的异常医疗处方检测方法[J]. 计算机科学与探索, 2020, 14(1): 30-39. |
[3] | 黄畅,郭文忠,郭昆. 面向微博热点话题发现的改进BBTM模型研究[J]. 计算机科学与探索, 2019, 13(7): 1102-1113. |
[4] | 周凯文,杨智慧,马会心,何震瀛,荆一楠,王晓阳. 面向特定划分的主题模型的设计与实现[J]. 计算机科学与探索, 2018, 12(7): 1036-1046. |
[5] | 闫蓉,高光来. 利用主题内容排序的伪相关反馈[J]. 计算机科学与探索, 2017, 11(5): 814-821. |
[6] | 韩俊明,王炜,李彤,何云. 面向开源软件的演化确认方法[J]. 计算机科学与探索, 2017, 11(4): 539-555. |
[7] | 沈桂兰,贾彩燕,于剑,杨小平. 适用于大规模信息网络的语义社区发现方法[J]. 计算机科学与探索, 2017, 11(4): 565-576. |
[8] | 韩俊明,王炜,李彤,何云. 演化软件的特征定位方法[J]. 计算机科学与探索, 2016, 10(9): 1201-1210. |
[9] | 李天辰,殷建平. 基于主题聚类的情感极性判别方法[J]. 计算机科学与探索, 2016, 10(7): 989-994. |
[10] | 刘娜,路莹,唐晓君,李明霞. 基于LDA重要主题的多文档自动摘要算法[J]. 计算机科学与探索, 2015, 9(2): 242-248. |
[11] | 徐彬,杨丹,张昱,李封,高克宁. 基于学习者行为特征的MOOCs学习伙伴推荐[J]. 计算机科学与探索, 2015, 9(1): 71-79. |
[12] | 吴蕾,张文生,王珏. 异构信息网络数据上的融合概率图模型[J]. 计算机科学与探索, 2014, 8(6): 712-718. |
[13] | 江雨燕,李平,王清,李常训. 融合DSTM和USTM方法的主题模型[J]. 计算机科学与探索, 2014, 8(5): 630-639. |
[14] | 徐彬,杨丹,张昱,李封,高克宁. 面向微博用户标签推荐的关系约束主题模型[J]. 计算机科学与探索, 2014, 8(3): 288-295. |
[15] | 张倩, 瞿有利. 用于网络评论分析的主题-对立情感挖掘模型[J]. 计算机科学与探索, 2013, 7(7): 620-629. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||