Journal of Frontiers of Computer Science and Technology ›› 2016, Vol. 10 ›› Issue (8): 1080-1091.DOI: 10.3778/j.issn.1673-9418.1509011

Previous Articles     Next Articles

Structural Join Processing for XML Based on MapReduce

LI Dong+, DENG Zehang, LI Zuli   

  1. School of Software Engineering, South China University of Technology, Guangzhou 510006, China
  • Online:2016-08-01 Published:2016-08-09


李  东+,邓泽航,李祖立   

  1. 华南理工大学 软件学院,广州 510006

Abstract: Extensible markup language (XML) has become the defacto standard of data representation and data exchange on Web. Hadoop is a typical framework for cloud computing and big data processing, thus making a study on XML query processing based on MapReduce is necessary. In order to implement the XML query processing based on MapReduce, this paper proposes three different encoding algorithms such as interval encoding, prefix encoding and hierarchy encoding, and designs the corresponding structural join algorithms based on MapReduce to support XML queries. This paper sets up a cost model for the query processing, and proposes a cost-based approach to determine the optimal execution tree. In the end, the XML query processing experiments are made, the experimental results show that relative to other two XML encoding schemes, the query processing based on interval encoding has a higher query performance. And the cost-based optimal approach is effective and further improves the performance of XML query processing.

Key words: extensible markup language (XML), structural join, MapReduce

摘要: 可扩展标记语言(extensible markup language,XML)已经成为Web上数据表达和数据交换的事实标准,Hadoop已成为云计算和大数据处理典型支撑框架之一,基于Hadoop MapReduce来实现XML查询处理十分必要。为了实现基于MapReduce的XML查询处理,首先实现了区间编码、前缀编码和层次编码等3种不同的XML数据编码方式,以此为基础来研究和实现基于MapReduce的XML结构连接处理。为查询处理建立了代价模型,通过代价估算获得优化的查询计划树。最后开展了XML查询处理实验评估,结果表明相对其他两种XML编码方式,区间编码方式下实现的查询处理速度较快,基于代价估算的优化方法能进一步有效地提高XML查询处理性能。

关键词: 可扩展标记语言(XML), 结构连接, MapReduce