计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (7): 1036-1046.DOI: 10.3778/j.issn.1673-9418.1709034

• 学术研究 • 上一篇    下一篇

面向特定划分的主题模型的设计与实现

周凯文,杨智慧,马会心,何震瀛,荆一楠,王晓阳   

  1. 复旦大学 计算机科学技术学院,上海 201203
  • 出版日期:2018-07-01 发布日期:2018-07-06

Design and Development of Partitional Topic Model

ZHOU Kaiwen, YANG Zhihui, MA Huixin, HE Zhenying, JING Yinan, WANG X. Sean   

  1. School of Computer Science, Fudan University, Shanghai 201203, China
  • Online:2018-07-01 Published:2018-07-06

摘要:

利用主题模型对文本数据进行处理、分析在如今的数据挖掘领域应用十分广泛,其中LDA(latent Dirichlet allocation)作为一个简单易用的主题模型受到了广泛的关注。然而LDA假设每篇文本都来源于一个独立的生成过程,忽略了文本之间的联系。从生成模型的角度建模文本之间的联系,基于LDA设计了一个新的主题模型DbLDA(LDA over text database)。DbLDA针对文本数据库的特定划分(例如时间、地点)建模,充分利用每个子集中的共性,提高了模型的表达能力。由于DbLDA模型复杂,使用部分收缩变分贝叶斯法对DbLDA进行模型推断,加快了模型训练速度。在新闻数据库上对DbLDA及LDA进行了训练和测试,实验结果验证了DbLDA拥有更好的模型效果。

关键词: 主题模型, 数据挖掘, 文本数据库

Abstract:

It's prevalent to use topic model to analyze documents in data mining at present. LDA (latent Dirichlet   allocation), as a simple topic model, has received much attention. However, LDA assumes the generating process of each document to be independent, which neglects the connection between documents. By modeling the connection between documents, this paper develops a new topic model DbLDA (LDA over text database). DbLDA explores the partitional structure of text databases (e.g., time, location), utilizes the commonalities inside each subset and thus is more expressive than original LDA. Due to the complexity of DbLDA, this paper uses partial collapsed variational Bayesian method to perform the model inference task, which has a fast training speed. For experiments, this paper trains DbLDA and LDA on news datasets. The experimental results justify that DbLDA yields a better performance than LDA.

Key words: topic model, data mining, text database