计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (12): 2861-2879.DOI: 10.3778/j.issn.1673-9418.2303083

• 前沿·综述 • 上一篇    下一篇

主题模型自动标记方法研究综述

何东彬,陶莎,朱艳红,任延昭,褚云霞   

  1. 1. 石家庄学院 河北省物联网安全与传感器检测工程研究中心,石家庄 050035
    2. 中国农业大学 农业农村部农业信息化标准化重点实验室,北京 100083
    3. 石家庄邮电职业技术学院 河北省物联网智能感知与应用技术创新中心,石家庄 050021
    4. 北京工商大学 计算机与信息工程学院,北京 100048
  • 出版日期:2023-12-01 发布日期:2023-12-01

Survey of Automatic Labeling Methods for Topic Models

HE Dongbin, TAO Sha, ZHU Yanhong, REN Yanzhao, CHU Yunxia   

  1. 1. IoT Security and Sensor Test Engineering Research Center of Hebei Province, Shijiazhuang University, Shijia-zhuang 050035, China
    2. Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing 100083, China
    3. Hebei Province IOT Intelligent Perception and Application Technology Innovation Center, Shijiazhuang Posts and Telecommunications Technical College, Shijiazhuang 050021, China
    4. School of Computer and Information Engineering, Beijing Technology and Business University, Beijing 100048, China
  • Online:2023-12-01 Published:2023-12-01

摘要: 主题模型常用于非结构化语料库和离散数据建模,抽取隐含主题分布。由于主题发现结果采用词列表形式,理解其含义较为困难。尽管通过人工标记可生成更具解释性和易理解的主题标签,但成本巨大缺乏可行性,而自动主题标记的研究为解决该问题提供了方法和思路。首先对当前最为流行的狄利克雷分配主题模型进行阐述与分析,并根据主题标签三种不同表现形式,基于短语、摘要和图片,将主题标记方法分为三种类型;之后围绕提高主题的可解释性,以生成的不同类型主题标签为线索,对近年来的相关研究成果进行梳理、分析和总结,并对不同标签的适用情境和可用性进行探讨;同时根据不同方法的特点进一步分类,重点对基于词法、子模优化和图排序方法生成摘要主题标签进行定量和定性分析,从学习类型、使用技术和数据来源出发,对比不同方法的差异;最后对主题自动标记研究存在的问题和趋势发展进行讨论,基于深度学习、与情感分析结合并不断拓展主题标记应用的场景,将是未来发展的重点和方向。

关键词: 主题模型, 潜在狄利克雷分配(LDA), 主题标记, 主题标签

Abstract: Topic models are often used in modeling unstructured corpora and discrete data to extract the latent topic. As topics are generally expressed in the form of word lists, it is usually difficult for users to understand the meanings of topics, especially when users lack knowledge in the subject area. Although manually labeling topics can generate more explanatory and easily understandable topic labels, the cost is too high for the method to be feasible. Therefore, research on automatic labeling of topic discovered provides solutions to the problem. Firstly, the currently most popular technique, latent Dirichlet allocation (LDA), is elaborated and analyzed. According to the three different representations of topic labels, based on phrases, abstracts, and pictures, the topic labeling methods are classified into three types. Then, centered on improving the interpretability of topics, with different types of generated topic labels utilized, the relevant research in recent years is sorted out, analyzed, and summarized. The applicable scenarios and usability of different labels are also discussed. Meanwhile, methods are further categorized according to their different characteristics. The focus is placed on the quantitative and qualitative analysis of the abstract topic labels generated through lexical-based, submodular optimization, and graph-based methods. The differences between separate methods with respect to the learning types, technologies used, and data sources are then compared. Finally, the existing problems and trend of development of research on automatic topic labeling are discussed. Based on deep learning, integrating with sentiment analysis, and continuously expanding the applicable scenarios of topic labeling, will be the directions of future development.

Key words: topic model, latent Dirichlet allocation (LDA), topic labeling, topic label