计算机科学与探索 ›› 2020, Vol. 14 ›› Issue (4): 578-589.DOI: 10.3778/j.issn.1673-9418.1912037

• 学术研究 • 上一篇    下一篇

文档检索中文本片段化机制的研究

李宇,刘波   

  1. 暨南大学 信息科学技术学院 计算机系,广州 510632
  • 出版日期:2020-04-01 发布日期:2020-04-10

Research on Text Snippet Mechanism in Document Retrieval

LI Yu, LIU Bo   

  1. College of Information Science and Technology, Jinan University, Guangzhou 510632, China
  • Online:2020-04-01 Published:2020-04-10

摘要:

文档检索是自然语言处理的研究热点,相对于短文本文档具有信息丰富且冗长的特征。在长文本检索中,查询语句与长文本中的句子往往不是全部相关,可能会出现某些高相似片段的强干扰,因此查询语句与文档之间的相关性评分不能简单采用基于词语或字符串之间的相似度计算。提出了一种文本片段化机制(TSM)进行文档检索,首先将每个候选文档划分成片段,再计算查询语句与文档片段之间的相关度,所使用的相关度匹配方案考虑了语义和词频等因素,筛选出关键的文本片段并得出相关片段比率,综合这些片段信息计算查询与文档之间的相关性得分,从而获取Top-K文档集。针对Glasgow信息检索专用数据集的实验结果表明,利用文本片段化机制进行文本匹配可以提高信息检索的性能。

关键词: 文本片段化机制, 文档检索, 相关性评分, 相关片段比例, 片段整合计算

Abstract:

Document retrieval is a research hotspot of natural language processing. Compared with short text document which has the characteristics of information diversity and length, in long text retrieval, a query statement is often not related to all sentences in a long text, and strong interference of some highly similar segments will occur. Therefore, the correlation score between a query statement and a document can not be simply calculated based on the similarity between words or strings. Text snippet mechanism (TSM) is proposed for document retrieval. TSM first divides each candidate document into snippets, and then calculates the correlation between query statements and document snippets. The correlation matching scheme used takes into account the semantic and word frequency factors. TSM selects key text snippets and obtains the relevant snippet ratio, and then calculates the correlation score between query and target document based on these information, so as to obtain the Top-K document set. Experimental results show that TSM can improve the performance of information retrieval on IR test collection of Glasgow.

Key words: text snippet mechanism, document retrieval, correlation calculation, relevant snippet ratio, snippet integration score