计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (1): 120-133.DOI: 10.3778/j.issn.1673-9418.1609018

• 人工智能与模式识别 • 上一篇    下一篇

基于分割的字符串相似性查找算法

刘慧婷+,黄厚柱,刘志中,赵  鹏   

  1. 安徽大学 计算机科学与技术学院,合肥 230601
  • 出版日期:2018-01-01 发布日期:2018-01-09

Partition-Based Algorithms for String Similarity Search

LIU Huiting+, HUANG Houzhu, LIU Zhizhong, ZHAO Peng   

  1. School of Computer Science and Technology, Anhui University, Hefei 230601, China
  • Online:2018-01-01 Published:2018-01-09

摘要: 字符串相似性查找问题主要包括两方面,基于阈值的字符串相似性查找以及top-k字符串相似性查找。目前处理基于阈值的字符串相似性查找问题的算法多是基于过滤-验证框架的。基于该框架提出了PBsearch算法,算法在过滤阶段首次加入One-Off条件过滤掉大量的无效匹配,并在验证阶段提出了一种新的验证算法MultiThreshold算法,大大减少了计算编辑距离的次数。在top-k字符串相似性查找问题方面,提出了两种基于分割思想的算法,Pb-topk算法和PbCount-topk算法。其中,Pb-topk算法采用差值递增的策略,减少了需处理的字符串数目;PbCount-topk算法采用匹配数目划分的策略,进一步缩小了候选集的规模。最后,通过在3个真实数据集上的实验结果,验证了提出算法的高效性。

关键词: 字符串相似性查找, 阈值, top-k, 分割, 编辑距离

Abstract: The problem of string similarity search involves two aspects, threshold-based string similarity search and top-k string similarity search. At present, the methods of dealing with threshold-based string similarity search problem mostly are based on the filter-and-verification framework. Based on this framework, this paper proposes an algorithm called PBsearch. In the filtering step, it adds One-Off condition to filter a large number of invalid matches. And in the verification step, it uses a new method called MultiThreshold to greatly reduce the number of edit distance computation. To deal with top-k string similarity search problem, this paper puts forward two partition-based algorithms called Pb-topk and PbCount-topk. Pb-topk algorithm uses the increment of the difference strategy to reduce the number of strings to be processed. And PbCount-topk algorithm uses the partition of the number of matching strategy to further refine the size of the candidate set. Finally, the experimental results on three real data sets verify the efficiency of these proposed algorithms.

Key words: string similarity search, threshold, top-k, partition, edit distance