Survey of Automatic Labeling Methods for Topic Models

doi:10.3778/j.issn.1673-9418.2303083

Abstract

Abstract: Topic models are often used in modeling unstructured corpora and discrete data to extract the latent topic. As topics are generally expressed in the form of word lists, it is usually difficult for users to understand the meanings of topics, especially when users lack knowledge in the subject area. Although manually labeling topics can generate more explanatory and easily understandable topic labels, the cost is too high for the method to be feasible. Therefore, research on automatic labeling of topic discovered provides solutions to the problem. Firstly, the currently most popular technique, latent Dirichlet allocation (LDA), is elaborated and analyzed. According to the three different representations of topic labels, based on phrases, abstracts, and pictures, the topic labeling methods are classified into three types. Then, centered on improving the interpretability of topics, with different types of generated topic labels utilized, the relevant research in recent years is sorted out, analyzed, and summarized. The applicable scenarios and usability of different labels are also discussed. Meanwhile, methods are further categorized according to their different characteristics. The focus is placed on the quantitative and qualitative analysis of the abstract topic labels generated through lexical-based, submodular optimization, and graph-based methods. The differences between separate methods with respect to the learning types, technologies used, and data sources are then compared. Finally, the existing problems and trend of development of research on automatic topic labeling are discussed. Based on deep learning, integrating with sentiment analysis, and continuously expanding the applicable scenarios of topic labeling, will be the directions of future development.

Key words: topic model, latent Dirichlet allocation (LDA), topic labeling, topic label

摘要： 主题模型常用于非结构化语料库和离散数据建模，抽取隐含主题分布。由于主题发现结果采用词列表形式，理解其含义较为困难。尽管通过人工标记可生成更具解释性和易理解的主题标签，但成本巨大缺乏可行性，而自动主题标记的研究为解决该问题提供了方法和思路。首先对当前最为流行的狄利克雷分配主题模型进行阐述与分析，并根据主题标签三种不同表现形式，基于短语、摘要和图片，将主题标记方法分为三种类型；之后围绕提高主题的可解释性，以生成的不同类型主题标签为线索，对近年来的相关研究成果进行梳理、分析和总结，并对不同标签的适用情境和可用性进行探讨；同时根据不同方法的特点进一步分类，重点对基于词法、子模优化和图排序方法生成摘要主题标签进行定量和定性分析，从学习类型、使用技术和数据来源出发，对比不同方法的差异；最后对主题自动标记研究存在的问题和趋势发展进行讨论，基于深度学习、与情感分析结合并不断拓展主题标记应用的场景，将是未来发展的重点和方向。

关键词: 主题模型, 潜在狄利克雷分配（LDA）, 主题标记, 主题标签

HE Dongbin, TAO Sha, ZHU Yanhong, REN Yanzhao, CHU Yunxia. Survey of Automatic Labeling Methods for Topic Models[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(12): 2861-2879.

何东彬, 陶莎, 朱艳红, 任延昭, 褚云霞. 主题模型自动标记方法研究综述[J]. 计算机科学与探索, 2023, 17(12): 2861-2879.

References

[1] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet alloca-tion[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[2] MEI Q, SHEN X, ZHAI C. Automatic labeling of multinomial topic models[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, Aug 12-15, 2007. New York: ACM, 2007: 490-499.
[3] KOU W, LI F, BALDWIN T. Automatic labelling of topic models using word vectors and letter trigram vectors[C]//LNCS 9460: Proceedings of the 11th Asia Information Ret-rieval Societies Conference on Information Retrieval Tech-nology, Brisbane, Dec 2-4, 2015. Cham: Springer, 2015: 253-264.
[4] WAN X, WANG T. Automatic labeling of topic models using text summaries[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Aug 7-12, 2016. Stroudsburg: ACL, 2017: 2297-2305.
[5] MEI Q, ZHAI C. Discovering evolutionary theme patterns from text: an exploration of temporal text mining[C]//Pro-ceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, Aug 21-24, 2005. New York: ACM, 2005: 198-207.
[6] MEI Q, LIU C, SU H, et al. A probabilistic approach to spatiotemporal theme pattern mining on weblogs[C]//Pro-ceedings of the 15th International Conference on World Wide Web, Edinburgh, May 23-26, 2006. New York: ACM, 2006: 533-542.
[7] LAU J H, GRIESER K, NEWMAN D, et al. Automatic la-belling of topic models[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Jun 19-24, 2011. Stroudsburg: ACL, 2011: 1536-1545.
[8] MAGATTI D, CALEGARI S, CIUCCI D, et al. Automatic labeling of topics[C]//Proceedings of the 9th International Conference on Intelligent Systems Design and Applications, Pisa, Nov 30-Dec 2, 2009. Washington: IEEE Computer Society, 2009: 1227-1232.
[9] 凌洪飞, 欧石燕. 面向主题模型的主题自动语义标注研究综述[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
LIN H F, OU S Y. Review of automatic semantic labeling for topic models[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 16-26.
[10] SALTON G, WONG A, YANG C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[11] TURNEY P D, PANTEL P. From frequency to meaning: vector space models of semantics[J]. Journal of Artificial Intelligence Research, 2010, 37: 141-188.
[12] DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the Ame-rican Society for Information Science, 1990, 41(6): 391-407.
[13] ZHAO W Z, MA H F, HE Q. Parallel K-means clustering based on MapReduce[C]//LNCS 5931: Proceedings of the 1st International Conference on Cloud Computing. Berlin,Heidelberg: Springer, 2009: 674-679.
[14] 周厚奎. 概率主题模型的研究及其在多媒体主题发现和演化中的应用 [D]. 杭州: 浙江大学, 2017.
ZHOU H K. Research on probabilistic topic model and its application in multimedia topic discovery and evolution[D]. Hangzhou: Zhejiang University, 2017.
[15] HOFMANN T. Probabilistic latent semantic indexing[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, Aug 15-19, 1999. New York: ACM, 1999: 50-57.
[16] TEH Y W, NEWMAN D, WELLING M. A collapsed varia-tional Bayesian inference algorithm for latent Dirichlet allo-cation[C]//Advances in Neural Information Processing Sys-tems 19, Vancouver, Dec 4-7, 2006. Cambridge: MIT Press, 2007: 1353-1360.
[17] PORTEOUS I, NEWMAN D, IHLER A, et al. Fast collapsed Gibbs sampling for latent Dirichlet allocation[C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Aug 24-27, 2008. New York: ACM, 2008: 569-577.
[18] CHRISTOU D. Feature extraction using latent Dirichlet allo-cation and neural networks: a case study on movie synopses [J]. arXiv:1604.01272, 2016.
[19] MEHROTRA R, SANNER S, BUNTINE W, et al. Impro-ving LDA topic models for microblogs via Tweet pooling and automatic labeling[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Jul 28-Aug 1, 2013. New York: ACM, 2013: 889-892.
[20] JEON H B, LEE S Y. Language model adaptation based on topic probability of latent Dirichlet allocation[J]. ETRI Journal, 2016, 38(3): 487-493.
[21] SANTANIELLO D, COLACE F, LOMBARDI M, et al. Sentiment analysis in social networks: a methodology based on the latent Dirichlet allocation approach[C]//Proceedings of the 11th Conference of the European Society for Fuzzy Logic and Technology, Prague, Sep 9-13, 2019. Amsterdam: Atlantis Press, 2019: 1-8.
[22] ALETRAS N, MITTAL A. Labeling topics with images using a neural network[C]//LNCS 10193: Proceedings of the 39th European Conference on IR Research, Aberdeen, Apr 8-13, 2017. Cham: Springer, 2017: 500-505.
[23] ALETRAS N, STEVENSON M. Labelling topics using unsupervised graph-based methods[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Jun 22-27, 2014. Stroudsburg: ACL, 2014: 631-636.
[24] HULPUS I, HAYES C, KARNSTEDT M, et al. Unsupervised graph-based topic labelling using DBpedia[C]//Proceedings of the 6th ACM International Conference on Web Search and Data Mining, Rome, Feb 4-8, 2013. New York: ACM, 2013: 465-474.
[25] BHATIA S, LAU J H, BALDWIN T. Automatic labelling of topics with neural embeddings[C]//Proceedings of the 26th International Conference on Computational Linguistics, Osaka, Dec 11-16, 2016. Stroudsburg: ACL, 2016: 953-963.
[26] ALOKAILI A, ALETRAS N, STEVENSON M. Re-ranking words to improve interpretability of automatically generated topics[C]//Proceedings of the 13th International Conference on Computational Semantics, Gothenburg, May 23-27, 2019. Stroudsburg: ACL, 2019: 43-54.
[27] KIM H H, RHEE H Y. An ontology-based labeling of in-fluential topics using topic network analysis[J]. Journal of Information Processing Systems, 2019, 15(5): 1096-1107.
[28] SANJAYA N A, BA M L, ABDESSALEM T, et al. Harnes-sing truth discovery algorithms on the topic labelling pro-blem[C]//Proceedings of the 20th International Conference on Information Integration and Web-based Applications & Services, Yogyakarta, Nov 19-21, 2018. New York: ACM, 2018: 8-14.
[29] KOZONO R, SAGA R. Automatic labeling for hierarchical topics with NETL[C]//Proceedings of the 2020 IEEE Inter-national Conference on Systems, Man, and Cybernetics, To-ronto, Oct 11-14, 2020. Piscataway: IEEE, 2020: 3740-3745.
[30] ZOSA E, PIVOVAROVA L, BOGGIA M, et al. Multilingual topic labelling of news topics using ontological mapping[C]//LNCS 13186: Proceedings of the 44th European Con-ference on IR Research, Stavanger, Apr 10-14, 2022. Cham: Springer, 2022: 248-256.
[31] POPA C, REBEDEA T. BART-TL: weakly-supervised topic label generation[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Apr 19-23, 2021. Stroudsburg: ACL, 2021: 1418-1425.
[32] KINARIWALA S A, DESHMUKH S. Onto_TML: auto-labeling of topic models[J]. Journal of Integrated Science and Technology, 2021, 9(2): 85-91.
[33] ALOKAILI A, ALETRAS N, STEVENSON M. Automatic generation of topic labels[C]//Proceedings of the 43rd Inter-national ACM SIGIR Conference on Research and Deve-lopment in Information Retrieval, Jul 25-30, 2020. New York: ACM, 2020: 1965-1968.
[34] TIWARI P, TRIPATHI A, SINGH A, et al. Advanced hierar-chical topic labeling for short text[J]. IEEE Access, 2023,11: 35158-35174.
[35] ALLAHYARIA M, POURIYEHA S, KOCHUTA K, et al. OntoLDA: an ontology-based topic model for automatic topic labeling[Z]. Amsterdam: IOS Press, 2009: 1-20.
[36] SHAHRIAR K T, MONI M A, HOQUE M M, et al. SATLabel: a framework for sentiment and aspect terms based automatic topic labelling[C]//Proceedings of Machine Intelligence and Data Science Applications 2021, Cumilla, Dec 2021. Berlin, Heidelberg: Springer, 2022: 63-75.
[37] HE D, WANG M, KHATTAK A M, et al. Automatic labeling of topic models using graph-based ranking[J]. IEEE Access, 2019, 7: 131593-131608.
[38] BASAVE A E C, HE Y, XU R. Automatic labelling of topic models learned from twitter by summarisation[C]//Proceedings of the 52nd Annual Meeting of the Association for Compu-tational Linguistics, Baltimore, Jun 22-27, 2014. Stroudsburg: ACL, 2014: 618-624.
[39] BARAWI M H, LIN C, SIDDHARTHAN A. Automatically labelling sentiment-bearing topics with descriptive sentence labels[C]//LNCS 10260: Proceedings of the 22nd International Conference on Applications of Natural Language to Infor-mation Systems, Liège, Jun 21-23, 2017. Cham: Springer, 2017: 299-312.
[40] HE D, REN Y, KHATTAK A M, et al. Automatic topic la-beling model with paired-attention based on pre-trained deep neural network[C]//Proceedings of the 2021 International Joint Conference on Neural Networks, Shenzhen, Jul 18-22, 2021. Piscataway: IEEE, 2021: 1-9.
[41] KOZBAGAROV O, MUSSABAYEV R, MLADENOVIC N. A new sentence-based interpretative topic modeling and automatic topic labeling[J]. Symmetry, 2021, 13(5): 837.
[42] ALETRAS N, STEVENSON M. Representing topics using images[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies, Atlanta, Jun 9-14, 2013. Stroudsburg: ACL, 2013: 158-167.
[43] SORODOC I, LAU J H, ALETRAS N, et al. Multimodal topic labelling[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2017: 701-706.
[44] NGUYEN C T, ZHAN D C, ZHOU Z H. Multi-modal image annotation with multi-instance multi-label LDA[C]//Procee-dings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, Aug 3-9, 2013. Menlo Park: AAAI, 2013: 1558-1564.
[45] ALETRAS N, BALDWIN T, LAU J H, et al. Evaluating topic representations for exploring document collections[J]. Journal of the Association for Information Science and Technology, 2017, 68(1): 154-167.
[46] MAO X L, MING Z Y, ZHA Z J, et al. Automatic labeling hierarchical topics[C]//Proceedings of the 21st ACM Inter-national Conference on Information and Knowledge Man-agement. New York: ACM, 2012: 2383-2386.
[47] REIMERS N, GUREVYCH I. Sentence-BERT: sentence em-beddings using siamese BERT-networks[J]. arXiv:1908.10084, 2019.
[48] LEWIS M, LIU Y, GOYAL N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[J]. arXiv:1910.13461, 2019.
[49] CHEN J, YAN J, ZHANG B, et al. Diverse topic phrase extraction through latent semantic analysis[C]//Proceedings of the 6th International Conference on Data Mining, Hong Kong, China, Dec 18-22, 2006. Washington: IEEE Computer Society, 2007: 834-838.
[50] CHINCHOR N, ROBINSON P. MUC-7 named entity task definition[C]//Proceedings of the 7th Conference on Message Understanding. Stroudsburg: ACL, 1998: 1-21.
[51] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their composi-tionality[C]//Proceedings of the 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, Dec 5-8, 2013. Red Hook: Curran Associates, 2013: 3111-3119.
[52] LE Q V, MIKOLOV T. Distributed representations of sentences and documents[C]//Proceedings of the 31st International Con-ference on Machine Learning, Beijing, Jun 21-26, 2014: 1188-1196.
[53] PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Lan-guage Processing, Doha, Oct 25-29, 2014. Stroudsburg: ACL, 2014: 1532-1543.
[54] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North Ame-rican Chapter of the Association for Computational Linguis-tics: Human Language Technologies, Minneapolis, Jun 2-7, 2019. Stroudsburg: ACL, 2019: 4171-4186.
[55] HULPUS I, HAYES C, KARNSTEDT M, et al. An eigen-value-based measure for word-sense disambiguation[C]//Pro-ceedings of the 25th International Florida Artificial Intelligence Research Society Conference, Marco Island, May 23-25, 2012. Menlo Park: AAAI, 2012: 1-6.
[56] BOUMA G. Normalized (pointwise) mutual information in collocation extraction[C]//Proceedings of the 2009 International Conference of the German Society for Computational Lin-guistics and Language Technology, Potsdam, 2009: 31-40.
[57] PAGE L, BRIN S, MOTWANI R, et al. The pagerank citation ranking: bringing order to the web[R]. Stanford InfoLab, 1999: 1-17.
[58] SMITH A, LEE T Y, POURSABZI-SANGDEH F, et al. Evaluating visual representations for topic understanding and their effects on manually generated topic labels[J]. Transac-tions of the Association for Computational Linguistics, 2017, 5: 1-16.
[59] CARBONELL J, GOLDSTEIN J. The use of MMR, diversity-based reranking for reordering documents and producing summaries[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Aug 24-28, 1998. New York: ACM, 1998: 335-336.
[60] MIHALCEA R, TARAU P. TextRank: bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, A Meeting of SIGDAT, a Special Interest Group of the ACL, Held in Conjunction with ACL 2004, Barcelona, Jul 25-26, 2004. Stroudsburg: ACL, 2004: 404-411.
[61] HE D, REN Y, KHATTAK A M, et al. Automatic topic labeling using graph-based pre-trained neural embedding[J]. Neurocomputing, 2021, 463: 596-608.
[62] REN P, CHEN Z, REN Z, et al. Sentence relations for extrac-tive summarization with deep neural networks[J]. ACM Tran-sactions on Information Systems, 2018, 36(4): 1-32.
[63] REN P, WEI F, ZHUMIN C, et al. A redundancy-aware sen-tence regression framework for extractive summarization[C]//Proceedings of the 26th International Conference on Com-putational Linguistics, Osaka, Dec 11-16, 2016. Stroudsburg: ACL, 2016: 33-43.
[64] FUJISHIGE S. Submodular functions and optimization[M]. New York: Elsevier Science Inc., 2005.
[65] LIN H, BILMES J. Multi-document summarization via bud-geted maximization of submodular functions[C]//Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Associa-tion for Computational Linguistics, Los Angeles, Jun 2-4, 2010. Stroudsburg: ACL, 2010: 912-920.
[66] LIN H, BILMES J. A class of submodular functions for document summarization[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Jun 19-24, 2011. Stroudsburg: ACL, 2011: 510-520.
[67] MALLICK C, DAS A K, DUTTA M, et al. Graph-based text summarization using modified TextRank[M]//Soft Computing in Data Analytics. Cham: Springer, 2019: 137-146.
[68] BRIN S, PAGE L. The anatomy of a large-scale hypertextual web search engine[J]. Computer Networks and ISDN Systems, 1998, 30: 107-117.
[69] ERKAN G, RADEV D R. LexRank: graph-based lexical cen-trality as salience in text summarization[J]. Journal of Artificial Intelligence Research, 2004, 22: 457-479.
[70] LIU Y. Fine-tune BERT for extractive summarization[J]. arXiv:1903.10318, 2019.
[71] LOWE D G. Object recognition from local scale-invariant features[C]//Proceedings of the 1999 International Conference on Computer Vision, Kerkyra, Sep 20-25, 1999. Washington: IEEE Computer Society, 1999: 1150-1157.
[72] LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110.
[73] ZHOU Z H, ZHANG M L. Multi-instance multi-label learning with application to scene classification[C]//Proceedings of the 2006 International Conference on Neural Information Processing Systems, Vancouver, Dec 4-7, 2006. Cambridge: MIT Press, 2006: 1609-1616.
[74] LEVY O, GOLDBERG Y. Dependency-based word embed-dings[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Jun 22-27, 2014. Stroudsburg: ACL, 2014: 302-308.
[75] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recogni-tion, Miami, Jun 20-25, 2009. Washington: IEEE Computer Society, 2009: 248-255.
[76] SIMONYAN K, ZISSERMAN A. Very deep convolutional net-works for large-scale image recognition[J]. arXiv:1409.1556, 2014.
[77] BENGIO Y， DUCHARME R， VINCENT P， et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155.
[78] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:1301.3781, 2013.
[79] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 5998-6008.
[80] BASTANI K, NAMAVARI H, SHAFFER J. Latent Dirichlet allocation (LDA) for topic modeling of the CFPB consumer complaints[J]. Expert Systems with Applications, 2019, 127: 256-271.
[81] SONG S, WANG C, CHEN H, et al. An emotional comfort framework for improving user satisfaction in E-commerce customer service chatbots[C]//Proceedings of the 2021 Con-ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, Jun 6-11, 2021. Stroudsburg: ACL, 2021: 130-137.