计算机科学与探索 ›› 2012, Vol. 6 ›› Issue (9): 829-843.DOI: 10.3778/j.issn.1673-9418.2012.09.007

• 学术研究 • 上一篇    下一篇

TDCOL:列式存储的XML关键字查询处理策略

周军锋+,田姗姗,蓝国翔,陈子阳,郭景峰   

  1. 燕山大学 信息科学与工程学院, 河北 秦皇岛 066004
  • 出版日期:2012-09-01 发布日期:2012-09-03

TDCOL: XML Keyword Query Processing Strategy Based on Column Storage

ZHOU Junfeng+, TIAN Shanshan, LAN Guoxiang, CHEN Ziyang, GUO Jingfeng   

  1. School of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China
  • Online:2012-09-01 Published:2012-09-03

摘要: 针对已有方法在XML数据上基于SLCA(smallest lowest common ancestor)语义处理查询时存在的冗余计算问题,提出了一种基于列存储的倒排索引CList,用于避免已有方法的倒排表中相同数据重复存储的问题。基于CList,提出了一种自顶向下的查询处理算法TDCOL(top-down SLCA computation based on column storage)来提升系统的处理性能。对于给定查询Q={k1, k2, ..., km}的每个公共祖先结点, TDCOL在保证仅处理一次的情况下即可得到所有满足条件的结果, 因而将时间复杂度降为[O(m×|LID1|×lb|Skmaxch(v)|)],其中[|LID1|]是Q的最短倒排表中包含的不同ID值的数目,[Skmaxch(v)]是所有被处理结点的包含关键字的孩子结点集中的最大集合。最后通过比较各种指标,从不同角度对TDCOL算法的性能优势进行了验证。

关键词: 可扩展标记语言(XML), 关键字查询, 列存储

Abstract: Considering that existing methods suffer from redundant computation when processing XML keyword queries based on SLCA (smallest lowest common ancestor) semantics, this paper proposes a new inverted list based on column storage, namely CList, to avoid the problem of repeatedly storing the same value in inverted lists. Based on CList, the paper proposes an efficient algorithm, i.e., TDCOL (top-down SLCA computation based on column storage), which processes all node IDs in CList in a top-down way to accelerate the overall performance. For a given keyword query Q={k1, k2, ..., km} and each of its common ancestor node, TDCOL processes it just once to get all qualified results, thus can reduce the time complexity to [O(m×|LID1|×lb|Skmaxch(v)|),] where [|LID1|] is the number of distinct IDs in the shortest inverted list of Q, while [Skmaxch(v)] is the child set of v, which has the largest number of child nodes among all processed nodes that contain some keywords of Q. The experimental results demonstrate the performance benefits of the proposed method in adding keyword search on XML data.

Key words: extensible markup language (XML), keyword search, column storage