TDCOL：列式存储的XML关键字查询处理策略

doi:10.3778/j.issn.1673-9418.2012.09.007

计算机科学与探索 ›› 2012, Vol. 6 ›› Issue (9): 829-843.DOI: 10.3778/j.issn.1673-9418.2012.09.007

TDCOL：列式存储的XML关键字查询处理策略

周军锋+，田姗姗，蓝国翔，陈子阳，郭景峰

燕山大学信息科学与工程学院, 河北秦皇岛 066004

出版日期:2012-09-01 发布日期:2012-09-03

TDCOL: XML Keyword Query Processing Strategy Based on Column Storage

ZHOU Junfeng+, TIAN Shanshan, LAN Guoxiang, CHEN Ziyang, GUO Jingfeng

School of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China

Online:2012-09-01 Published:2012-09-03

摘要/Abstract

摘要： 针对已有方法在XML数据上基于SLCA（smallest lowest common ancestor）语义处理查询时存在的冗余计算问题，提出了一种基于列存储的倒排索引CList，用于避免已有方法的倒排表中相同数据重复存储的问题。基于CList，提出了一种自顶向下的查询处理算法TDCOL（top-down SLCA computation based on column storage）来提升系统的处理性能。对于给定查询Q={k1, k2, ..., km}的每个公共祖先结点, TDCOL在保证仅处理一次的情况下即可得到所有满足条件的结果, 因而将时间复杂度降为[O(m×|LID1|×lb|Skmaxch(v)|)]，其中[|LID1|]是Q的最短倒排表中包含的不同ID值的数目，[Skmaxch(v)]是所有被处理结点的包含关键字的孩子结点集中的最大集合。最后通过比较各种指标，从不同角度对TDCOL算法的性能优势进行了验证。

关键词: 可扩展标记语言（XML）, 关键字查询, 列存储

Abstract: Considering that existing methods suffer from redundant computation when processing XML keyword queries based on SLCA (smallest lowest common ancestor) semantics, this paper proposes a new inverted list based on column storage, namely CList, to avoid the problem of repeatedly storing the same value in inverted lists. Based on CList, the paper proposes an efficient algorithm, i.e., TDCOL (top-down SLCA computation based on column storage), which processes all node IDs in CList in a top-down way to accelerate the overall performance. For a given keyword query Q={k1, k2, ..., km} and each of its common ancestor node, TDCOL processes it just once to get all qualified results, thus can reduce the time complexity to [O(m×|LID1|×lb|Skmaxch(v)|),] where [|LID1|] is the number of distinct IDs in the shortest inverted list of Q, while [Skmaxch(v)] is the child set of v, which has the largest number of child nodes among all processed nodes that contain some keywords of Q. The experimental results demonstrate the performance benefits of the proposed method in adding keyword search on XML data.

Key words: extensible markup language (XML), keyword search, column storage

周军锋，田姗姗，蓝国翔，陈子阳，郭景峰. TDCOL：列式存储的XML关键字查询处理策略[J]. 计算机科学与探索, 2012, 6(9): 829-843.

ZHOU Junfeng, TIAN Shanshan, LAN Guoxiang, CHEN Ziyang, GUO Jingfeng. TDCOL: XML Keyword Query Processing Strategy Based on Column Storage[J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(9): 829-843.

[1]	母红芬，李征，霍卫平，金正皓. HashMap优化及其在列存储数据库查询中的应用[J]. 计算机科学与探索, 2016, 10(9): 1250-1261.
[2]	李东，邓泽航，李祖立. 基于MapReduce的XML结构连接处理[J]. 计算机科学与探索, 2016, 10(8): 1080-1091.
[3]	范红杰，柳军飞，周鲁东，麻志毅. 多策略相似度整合的XML模式匹配方法[J]. 计算机科学与探索, 2016, 10(1): 14-24.
[4]	宋玉玲，王宁. 利用实体语义信息的关键字查询结果多样化[J]. 计算机科学与探索, 2014, 8(3): 266-274.
[5]	毕鑫，王国仁，赵相国，袁野，张盼. XML数据中Twig查询处理与优化技术研究综述[J]. 计算机科学与探索, 2013, 7(9): 769-782.
[6]	陆嘉俊，黄志球，王进，沈国华，柯昌博. 面向行为的Web服务组合隐私策略描述研究[J]. 计算机科学与探索, 2013, 7(7): 592-601.
[7]	廖湖声，李小青. XML树模式查询的描述语言及形式语义[J]. 计算机科学与探索, 2013, 7(5): 431-441.
[8]	黄山，王波涛，王国仁，于戈，李佳佳. MapReduce优化技术综述[J]. 计算机科学与探索, 2013, 7(10): 865-885.
[9]	陆戌辰，王梅，乐嘉锦. 列存储中的OLAP多查询优化方法[J]. 计算机科学与探索, 2012, 6(9): 852-864.
[10]	刘喜平，万常选，刘德喜. XML关键词搜索结果的多样化[J]. 计算机科学与探索, 2012, 6(10): 935-947.
[11]	姜国华, 姜守旭, 王宏志, 李建中, 高宏. 标签劣质的XML数据上的查询处理 [J]. 计算机科学与探索, 2011, 5(8): 673-685.
[12]	李静+ ;孙莉; 王梅 . 列存储数据查询中的连接策略选择方法*[J]. 计算机科学与探索, 2010, 4(9): 850-858.

TDCOL：列式存储的XML关键字查询处理策略

TDCOL: XML Keyword Query Processing Strategy Based on Column Storage

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics