计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (5): 630-639.DOI: 10.3778/j.issn.1673-9418.1312055

• 人工智能与模式识别 • 上一篇    

融合DSTM和USTM方法的主题模型

江雨燕+,李  平,王  清,李常训   

  1. 安徽工业大学 管理科学与工程学院,安徽 马鞍山 243002
  • 出版日期:2014-05-01 发布日期:2014-05-05

Topic Model Combining DSTM and USTM Methods

JIANG Yuyan+, LI Ping, WANG Qing, LI Changxun   

  1. School of Management Science and Engineering, Anhui University of Technology, Ma’anshan, Anhui 243002, China
  • Online:2014-05-01 Published:2014-05-05

摘要: 当前监督或半监督隐藏狄利克雷分配(latent Dirichlet allocation,LDA)模型多数采用DSTM(downstream supervised topic model)或USTM(upstream supervised topic model)方式加入额外信息,使得模型具有较高的主题提取和数据降维能力,然而无法处理包含多种额外信息的学术文档数据。通过对LDA及其扩展模型的研究,提出了一种将DSTM和USTM结合的概率主题模型ART(author & reference topic)。ART模型分别以USTM和DSTM方式构建了文档作者和引用文献的生成过程,因此可以对既包含作者信息又包含引用文献信息的文档进行有效的分析处理。在实验过程中采用Stochastic EM Sampling方法对模型参数进行了学习,并将实验结果与Labeled LDA和DMR模型进行了对比。实验结果表明,ART模型不仅拥有高效的文档主题提取和聚类能力,同时还拥有优良的文档作者判别和引用文献排序能力。

关键词: 隐藏狄利克雷分配(LDA), 监督主题模型, 文档聚类, 作者预测

Abstract: Most of supervised and semi-supervised latent Dirichlet allocation (LDA) models add metadata based on DSTM (downstream supervised topic model) or USTM (upstream supervised topic model) methods, which can improve the capabilities of topics extraction and dimension reduction. However those models can not analyze academic documents which have more than one kind of metadata. Based on the research on the LDA model and its modifications, this paper proposes a new LDA model namely author & references topic (ART) model. The ART model defines the generation process of authors and references by USTM and DSTM which makes the model be able to analyze documents both with authors and references information. In the experiment, Stochastic EM Sampling method is used to learn the parameters of ART model and the ART model is compared with Labeled LDA and DMR models. The experimental results show that the ART model not only has efficient capabilities of academic documents topic extraction and clustering, but also can give an accurate prediction of authors for a new document.

Key words: latent Dirichlet allocation (LDA), supervised topic model, documents clustering, predicting authors