计算机科学与探索 ›› 2016, Vol. 10 ›› Issue (7): 948-958.DOI: 10.3778/j.issn.1673-9418.1509010

• 数据库技术 • 上一篇    下一篇

基于Hadoop平台的语义数据查询策略研究

胡志刚,景冬梅,陈柏林,杨  柳+   

  1. 中南大学 软件工程学院,长沙 410073
  • 出版日期:2016-07-01 发布日期:2016-07-01

Research on Semantic Data Query Method Based on Hadoop

HU Zhigang, JING Dongmei, CHEN Bailin, YANG Liu+   

  1. College of Software Engineering, Central South University, Changsha 410073, China
  • Online:2016-07-01 Published:2016-07-01

摘要: 为了实现对海量RDF(resource description framework)数据的高效查询,研究了RDF三元组在分布式数据库HBase中的存储方法,基于MapReduce设计了海量RDF数据的两阶段查询策略,将查询分为SPARQL(simple protocol and RDF query language)预处理阶段与分布式查询执行阶段。SPARQL预处理阶段设计实现了基于SPARQL变量关联度的查询划分算法JOVR(join on variable relation),通过计算SPARQL查询语句中变量的关联度确定连接变量的连接顺序,根据连接变量将SPARQL子句连接操作划分到最小数量的Map- Reduce任务中;分布式查询执行阶段执行SPARQL预处理阶段划分的MapReduce任务,实现对海量RDF数据的并行查询。在LUBM标准测试数据集中的实验表明,JOVR算法能够高效地实现对海量RDF数据的查询,并具有良好的稳定性与可扩展性。

关键词: 并行处理, 语义信息查询策略, MapReduce, SPARQL, 海量RDF

Abstract: In order to achieve the efficient query for large-scale RDF (resource description framework) data, this paper analyzes the storage method of RDF triples in HBase and designs a two-stage query strategy for large-scale RDF data based on MapReduce, which is divided into two stages: the SPARQL (simple protocol and RDF query language) pretreatment stage and the distributed query execution stage. In the SPARQL pretreatment stage, an SPARQL query classification algorithm—JOVR (join on variable relation) is implemented, which determines the join order of connection variables by calculating the correlation between the variables in an SPARQL query statement, then the join between SPARQL clauses is divided into the minimum number of MapReduce jobs according to the connection variables. The distributed query execution stage accomplishes large-scale RDF data query concurrently based on MapRdecue jobs from SPARQL pretreatment stage. The experimental results on the LUMB benchmark set indicate that JOVR can query large-scale RDF data efficiently with good stability and scalability.

Key words: parallel processing, semantic information query strategy, MapReduce, simple protocol and RDF query language (SPARQL), large-scale RDF