Journal of Frontiers of Computer Science and Technology ›› 2013, Vol. 7 ›› Issue (8): 747-753.DOI: 10.3778/j.issn.1673-9418.1305004

Previous Articles     Next Articles

Extracting Overlapping Topics from Micro-Blog Based on Mixture Model

ZHAN Yong, YANG Yan+, WANG Hongjun   

  1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, China
  • Online:2013-08-01 Published:2013-08-06


詹  勇,杨  燕+,王红军   

  1. 西南交通大学 信息科学与技术学院,成都 610031

Abstract: Micro-blog is a new platform to share and disseminate information quickly. It is characterized by huge amount of scattered and diverse information. The most of traditional topics extraction algorithms are partitioning method, which do not consider the relationship between the topics, so there are some limitations. This paper focuses on the task of news topics extraction from large-scale short posts of micro-blog service. The word segmentation is processed according to the characteristics of the micro-blog text using the Chinese word segmentation software with high accuracy and ambiguity recognition, which is developed by Institute of Noetics and Wisdom, Southwest Jiaotong University. And then, this paper proposes an overlapping topic detection algorithm based on mixture model. The experimental results prove the feasibility and validity of the algorithm.

Key words: micro-blog, overlapping topic detection, mixture model

摘要: 微博具有信息量庞大,信息分散多样等特点,已经成为快速分享和传播信息的新平台。传统话题发现算法大部分都是基于划分的,没有考虑话题之间的关联性,存在一定的局限性,因此研究了大规模微博文本集上的话题发现问题。采用具有分词准确率较高、歧义识别特点的西南交通大学思维与智慧研究所中文分词系统对文本进行分词处理,并提出了基于混合模型的微博交叉话题发现算法。实验结果表明,该算法具有一定可行性和有效性。

关键词: 微博, 交叉话题发现, 混合模型