计算机科学与探索 ›› 2017, Vol. 11 ›› Issue (5): 814-821.DOI: 10.3778/j.issn.1673-9418.1603068

• 人工智能与模式识别 • 上一篇    下一篇

利用主题内容排序的伪相关反馈

闫  蓉+,高光来   

  1. 内蒙古大学 计算机学院,呼和浩特 010021
  • 出版日期:2017-05-01 发布日期:2017-05-04

Using Topic Content Ranking for Pseudo Relevance Feedback

YAN Rong+, GAO Guanglai   

  1. College of Computer Science, Inner Mongolia University, Hohhot 010021, China
  • Online:2017-05-01 Published:2017-05-04

摘要: 传统的伪相关反馈(pseudo relevance feedback,PRF)方法,将文档作为基本抽取单元进行查询扩展,抽取粒度过大造成扩展源中噪音量的增加。研究利用主题分析技术来减轻扩展源的低质量现象。通过获取隐藏在伪相关文档集(pseudo-relevant set)各文档内容中的语义信息,并从中提取与用户查询相关的抽象主题内容作为基本抽取单元用于查询扩展。在NTCIR 8中文语料上,与传统PRF方法和基于主题模型的PRF方法相比较,实验结果表明该方法可以抽取出更符合用户查询的扩展词。此外,结果显示从更小的主题内容粒度出发进行查询扩展,可以有效提升检索性能。

关键词: 主题模型, 主题内容, 伪相关反馈

Abstract: Traditional pseudo relevance feedback (PRF) algorithms use the document as a unit to extract words for query expansion, which will increase the noise of expansion source due to the larger extraction unit. This paper exploits the topic analysis techniques so as to alleviate the low quality of expansion source condition. Obtain semantic information hidden in the content of each document of pseudo-relevant set, and extract the abstract topic content information according to the relevance of the user query, which is described as a basic extraction unit to be used for query expansion. Compared with the traditional PRF algorithms and the PRF based on topic model algorithm, the experimental results on NTCIR 8 dataset show that the scheme in this paper can effectively extract more appropriate expansion terms. In addition, the results also show that the scheme in this paper has a positive impact to improve the retrieval performance on a smaller topic content granularity level.

Key words: topic model, topic content, pseudo relevance feedback (PRF)