计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (8): 919-932.DOI: 10.3778/j.issn.1673-9418.1403053

• 数据库技术 • 上一篇    下一篇

Top-k相似短文本快速抽取算法

顾彦慧1,赵  斌1,周俊生1,曲维光1,2+   

  1. 1. 南京师范大学 计算机科学与技术学院,南京 210023
    2. 南京大学 计算机软件新技术国家重点实验室,南京 210023
  • 出版日期:2014-08-01 发布日期:2014-08-07

Efficient Top-k Similar Short Texts Extraction Algorithm

GU Yanhui, ZHAO Bin, ZHOU Junsheng, QU Weiguang   

  1. 1. School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, China
    2. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Online:2014-08-01 Published:2014-08-07

摘要: 如何快速有效地抽取相似短文本是许多应用系统的重要研究问题。在大数据情况下,效率问题对于实际系统非常重要,而现有的短文本抽取策略不能满足用户对性能的要求。以相似短文本的抽取为研究对象,针对传统研究中的效率问题,对如何从给定的短文本集合中快速检索出top-k个近似短文本进行了研究,并基于一个有效的基本框架提出了一种快速策略,用于满足用户对效率的要求。实验结果证明了该策略在保证有效性不变的情况下,大幅度提高了抽取效率,并且在效率上优于现有方法。

null

关键词: 语义相似, top-k, 排序融合

Abstract: Extracting similar short texts efficiently is an essential research issue for many applications. However, most of the existing strategies focus on the effectiveness aspect. The existing state-of-the-art strategies cannot satisfy the users’ performance requirement while efficiency issue is important especially for current big data applications. This paper addresses the efficiency issue of extracting similar short texts, i.e., how to efficiently get the top-k semantic similar short texts to a query for a give sentence collection. This paper also proposes an efficient strategy to tackle the performance problems based on a basic framework. Extensive experimental evaluations demonstrate that the proposed strategy improves the extraction efficiency while keeping the effectiveness, and is better than the existing strategies in efficiency.

Key words: semantic similarity, top-k, rank aggregation