计算机科学与探索 ›› 2015, Vol. 9 ›› Issue (12): 1439-1449.DOI: 10.3778/j.issn.1673-9418.1505067

• 数据库技术 • 上一篇    下一篇

并行的XML数据流模式匹配算法

陈  冲+,蒋夏军,张青平   

  1. 南京航空航天大学 计算机科学与技术学院,南京 210016
  • 出版日期:2015-12-01 发布日期:2015-12-04

Parallel Matching Algorithm on XML Stream Data Pattern

CHEN Chong+, JIANG Xiajun, ZHANG Qingping   

  1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
  • Online:2015-12-01 Published:2015-12-04

摘要: 随着大数据时代的到来,大规模XML文件不断地涌现,其信息庞大,结构复杂,而传统的XML查询匹配技术需要大量的存储空间和预解析工作,不能有效解决XML大文件的匹配要求。针对这种现状,分析了现有经典匹配算法核心思想,并结合多线程并行相关知识,提出了一种新的并行的XML数据流模式匹配算法,称为并行路径流算法(parallel path stream,PPS)。该算法在以流模式顺序解析XML文件的过程中,缓存以查询模式根元素为根节点的子树,以顺序链表存储节点的编码信息,在进行有效过滤后加入任务链表中,采用独特的匹配方法并行操作任务池中的各个顺序链表后得到匹配结果。实验表明,该算法能够明显减少存储空间,其过滤过程和并行操作能够有效减少匹配时间,并在查询路径长度方面具有一定优势。

关键词: XML模式匹配, 流数据处理, 路径栈算法, 多线程

Abstract: With the arrival of the big-data era, XML files with large scale, huge information and complicated structure have emerged constantly. But the previous XML query matching algorithms can’t effectively solve the matching problem because they need lots of storage space and preliminary parse work. This paper proposes a new parallel XML stream data pattern matching algorithm named PPS (parallel path stream), which is based on the analysis of the core idea of the existing classic matching algorithm and integrates the knowledge on multi-thread parallel. In the process of sequentially parsing XML file as stream pattern, subtree, whose root is the root element of the query pattern, is stored as ordered list and is added into the task list after an effective filtration. Then the matching result can be gotten through parallel operating on the ordered list. The experimental results demonstrate that the PPS can significantly reduce the storage space, the filtration process and the parallel operation can effectively reduce the matching time, beyond that, the PPS has certain advantage in the query path length.

Key words: XML pattern matching, stream data processing, PathStack algorithm, multi-thread