计算机科学与探索 ›› 2011, Vol. 5 ›› Issue (9): 769-780.

• 学术研究 • 上一篇    下一篇

支持局部最优化匹配的近似子串查询算法

<SPAN style=

  

  1. 东北大学 信息科学与工程学院, 沈阳 110819

  • 收稿日期:1900-01-01 修回日期:1900-01-01

<SPAN lang=EN-US style=

LIU Honglei, YANG Xiaochun+, WANG Bin, JIN Rong

  

  1. College of Information Science and Engineering, Northeastern University, Shenyang 110819, China
  • Received:1900-01-01 Revised:1900-01-01

摘要: 基于编辑距离的字符串近似查询算法一般是先给定阈值k, 然后计算那些与查询串的编辑距离小于或等于k的结果。但是对于近似子串查询, 结果中有很多是交叠的, 并且是无意义的, 于是提出了一种局部最优化匹配的概念, 只计算那些符合阈值条件, 并且是局部最优的结果, 这样不仅避免了结果的交叠, 而且极大节省了时间开销。给出了支持局部最优化匹配的近似子串查询的定义, 相应提出了一种基于gram索引的局部最优化近似子串查询算法, 分析了子串近似匹配过程中的规律, 研究了基于局部最优化匹配的边界限定和过滤策略, 给出了一种过滤优化的局部最优化近似子串查询算法, 提高了查询效率。

关键词: 子串查询, 模糊查询, gram索引, 编辑距离

Abstract: The algorithms of approximate string query based on edit distance have usually a given threshold k, and report those strings whose edit distances with the query are not bigger than k. However, when talking about approximate substring query, many answers got in this way have overlap, thus meaningless. So this paper proposes a concept of local optimal matching only computing those answers which are both not higher than k and local optimal, thus eliminating the overlapped answers and cutting the time cost. It also presents a definition of approximate substring query supporting local optimal matching along with a query algorithm based on gram index. Moreover, it analyzes the generality of the matching process, and studies the methods of filtering and limiting the boundary based on local optimal matching. Finally, it proposes an algorithm of local optimal approximate substring query with filtering, so that the efficiency of matching is promoted.

Key words: substring query, fuzzy query, gram index, edit distance