Journal of Frontiers of Computer Science and Technology ›› 2012, Vol. 6 ›› Issue (9): 829-843.DOI: 10.3778/j.issn.1673-9418.2012.09.007

Previous Articles     Next Articles

TDCOL: XML Keyword Query Processing Strategy Based on Column Storage

ZHOU Junfeng+, TIAN Shanshan, LAN Guoxiang, CHEN Ziyang, GUO Jingfeng   

  1. School of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China
  • Online:2012-09-01 Published:2012-09-03



  1. 燕山大学 信息科学与工程学院, 河北 秦皇岛 066004

Abstract: Considering that existing methods suffer from redundant computation when processing XML keyword queries based on SLCA (smallest lowest common ancestor) semantics, this paper proposes a new inverted list based on column storage, namely CList, to avoid the problem of repeatedly storing the same value in inverted lists. Based on CList, the paper proposes an efficient algorithm, i.e., TDCOL (top-down SLCA computation based on column storage), which processes all node IDs in CList in a top-down way to accelerate the overall performance. For a given keyword query Q={k1, k2, ..., km} and each of its common ancestor node, TDCOL processes it just once to get all qualified results, thus can reduce the time complexity to [O(m×|LID1|×lb|Skmaxch(v)|),] where [|LID1|] is the number of distinct IDs in the shortest inverted list of Q, while [Skmaxch(v)] is the child set of v, which has the largest number of child nodes among all processed nodes that contain some keywords of Q. The experimental results demonstrate the performance benefits of the proposed method in adding keyword search on XML data.

Key words: extensible markup language (XML), keyword search, column storage

摘要: 针对已有方法在XML数据上基于SLCA(smallest lowest common ancestor)语义处理查询时存在的冗余计算问题,提出了一种基于列存储的倒排索引CList,用于避免已有方法的倒排表中相同数据重复存储的问题。基于CList,提出了一种自顶向下的查询处理算法TDCOL(top-down SLCA computation based on column storage)来提升系统的处理性能。对于给定查询Q={k1, k2, ..., km}的每个公共祖先结点, TDCOL在保证仅处理一次的情况下即可得到所有满足条件的结果, 因而将时间复杂度降为[O(m×|LID1|×lb|Skmaxch(v)|)],其中[|LID1|]是Q的最短倒排表中包含的不同ID值的数目,[Skmaxch(v)]是所有被处理结点的包含关键字的孩子结点集中的最大集合。最后通过比较各种指标,从不同角度对TDCOL算法的性能优势进行了验证。

关键词: 可扩展标记语言(XML), 关键字查询, 列存储