计算机科学与探索 ›› 2013, Vol. 7 ›› Issue (12): 1135-1145.DOI: 10.3778/j.issn.1673-9418.1305045

• 学术研究 • 上一篇    下一篇

微博网站中面向主题的权威信息搜索技术研究

杨  平+,王  丹,赵文兵   

  1. 北京工业大学 计算机学院,北京 100124
  • 出版日期:2013-12-01 发布日期:2013-12-03

Research on Topic-Oriented Authoritative Information Retrieval Model in Microblog Site

YANG Ping+, WANG Dan, ZHAO Wenbing   

  1. College of Computer Science and Technology, Beijing University of Technology, Beijing 100124, China
  • Online:2013-12-01 Published:2013-12-03

摘要: 针对微博信息的稀疏性和时效性,研究了微博网站中面向主题的权威信息搜索问题。通过提取微博隐主题方法,缓解了微博文本信息数据稀疏性的问题;通过两阶段聚类算法,将微博网站中的信息按主题进行聚类,加快了微博信息搜索时间;提出了一种微博网站中面向主题权威信息的排序模型,该排序模型结合KL-divergence语言模型的伪相关反馈技术和时间因子来对微博信息进行排序,并利用第一次检索到的首页信息中转发次数较高的微博信息进行查询扩展。在新浪微博的真实数据集上的实验结果表明,提出的隐主题模型可以较好地解决微博数据稀疏性问题,并且权威信息排序模型相对于其他排序算法,在微博网站中进行信息搜索有更好的效果。

关键词: 微博网站, 隐主题, 聚类, 权威信息

Abstract: Aiming at the inherent sparsity and strong timeliness about microblog, this paper studies the retrieval problem of topic-oriented authoritative information in microblog site. Firstly, this paper presents the method extracting the implicit theme of microblog, which can effectively ease sparsity problem about microblog short text data. Furthermore, this paper uses a two-stage clustering algorithm into microblog site to classify information by topics, which can speed up searching time. Finally, this paper proposes an efficient rank model in microblog site, which combines pseudo relevance feedback technology of KL-divergence language model and time factor for rank, and uses the first-retrieved microblog information of home page with high retweeting numbers to conduct query expansion. The experimental results on real datasets from Sina microblog demonstrate that the proposed implicit theme model can considerably solve data sparseness problem, and the rank model of authoritative information has better performance in terms of real-time information search.

Key words: microblog site, implicit theme, clustering, authoritative information