计算机科学与探索 ›› 2016, Vol. 10 ›› Issue (3): 326-337.DOI: 10.3778/j.issn.1673-9418.1507070

• 数据库技术 • 上一篇    下一篇

利用关键因子过滤的正则表达式匹配算法

邱  涛+,王  斌,杨晓春   

  1. 东北大学 信息科学与工程学院,沈阳 110819
  • 出版日期:2016-03-01 发布日期:2016-03-11

Regular Expression Matching Algorithm Using Pivotal Factors

QIU Tao+, WANG Bin, YANG Xiaochun   

  1. College of Information Science and Engineering, Northeastern University, Shenyang 110819, China
  • Online:2016-03-01 Published:2016-03-11

摘要: 正则表达式(regular expression,RE)是一种能够提供复杂查询能力的技术,其通过特定的语法结构来描述一类文本的共同特征。正则表达式强大的表达能力和简洁的语法,使得其在各个领域都被广泛地应用。为了提高正则表达式的匹配效率,提出了一种利用关键因子进行过滤的匹配技术,关键因子指的是在文本中具有最小出现频率的有效过滤因子。由于实际文本中字符并不是均匀分布的,子串在文本中出现频率的差异将影响过滤因子的过滤能力。通过考虑有效过滤因子在文本中出现的频率,关键因子能获得更好的过滤能力。提出了利用正则表达式的划分来求取关键因子的算法,进而通过关键因子来过滤候选位置。通过在真实的蛋白序列和英文文本上进行实验,说明了基于关键因子过滤的匹配方法可以有效地提升正则表达式的匹配性能。

关键词: 正则表达式, 过滤策略, 算法性能, 关键因子

Abstract: Regular expression (RE) can describe complicated query, it depicts the common features of a set of text by the specific syntax. Since regular expression has the strong ability of expression and succinct syntax, it has been applied in many applications. In order to improve the matching performance of RE, this paper proposes a novel technique that pruning false negatives by utilizing pivotal factors, which are the valid filters with the least occurrences in text. Since the characters in text are not well-distributed, the difference of occurrences count of substrings will affect the pruning power of filters. Pivotal factors can achieve better pruning power by considering the occurrences count in text, which can also improve the matching performance of RE. This paper proposes the algorithm that computing pivotal factors by the partitions of RE, then further prunes the candidate positions by pivotal factors. Comprehensive experiments are conducted on real protein sequences and English text, the results show the significant performance improvement when utilizing the technique of pivotal factors.

Key words: regular expression, filtering strategy, performance, pivotal factors