计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (9): 1361-1371.DOI: 10.3778/j.issn.1673-9418.1709040

• 数据库技术 • 上一篇    下一篇

Spark GraphX上的SPARQL查询处理算法

邱    慧+,邹兆年   

  1. 哈尔滨工业大学 计算机科学与技术学院,哈尔滨 150001
  • 出版日期:2018-09-01 发布日期:2018-09-10

SPARQL Query Processing Algorithm on Spark GraphX

QIU Hui+, ZOU Zhaonian   

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • Online:2018-09-01 Published:2018-09-10

摘要: 资源描述框架(resource description framework,RDF)由于其表示的灵活性和天然的图数据模型而变得越来越流行。与此同时,RDF数据的数据量也在以惊人的速度增长。由于数据量的增长,在单机上存储和查询RDF数据变得越来越不方便,从而激发了分布式存储查询的需求。学术界在分布式存储查询系统,例如Hadoop、Spark上已经做了大量的工作。基于Hadoop的分布式存储查询方式的主要缺点是中间结果需要被写回磁盘,从而产生大量的I/O操作。提出了一种新的在Spark GraphX上进行SPARQL查询评估的方法SQX,将RDF数据视为一个带标签的属性图,提出了一种新的查询计划生成方案并且通过图并行的方式实现SPARQL查询评估。SQX采用了一种“查询树匹配”+“结果过滤”的方法。针对每一个SPARQL查询,产生相应的查询树和约束条件。在每一轮的超级步中,查询树中的多条边可以被并行处理,对迭代执行完毕后的结果进行过滤,满足约束条件的将作为最终的结果。实验结果表明,算法能够有效处理SPARQL查询并且具有良好的可扩展性。

关键词: 属性图, SPARQL查询, Spark GraphX, 查询树

Abstract: RDF (resource description framework) has become more and more popular due to its flexible and universal graph-like data model, and the size of RDF data is increasing in a rapid speed as well. It becomes infeasible and difficult to store and process RDF data on a single machine, which raises the requirement for the distributed parallel approaches. There are many approaches have been put forward on parallel computation framework, e.g., Hadoop and Spark for dealing with RDF storage and query. The major problem of the Hadoop-based methods is that the     intermediate results have to be written back to the disk, thereby incurring a lot of I/O operations. This paper proposes a novel method called SQX (SPARQL query on GraphX) for evaluating SPARQL (simple protocol and RDF query language) queries on Spark GraphX. By treating RDF data as a large property graph, it develops a new method to generate optimal query plans for SPARQL queries and executes them in a graph-parallel way. The SQX method adopts “query tree matching + result filtering” approach. Firstly, it generates the query tree and constraints from the SPARQL query and then in each super step, multiple query edges in the tree can be matched in a parallel fashion, and the final result will be generated from the iteration result which satisfies the constraints. Extensive experiments verify that this method is efficient and scalable in processing SPARQL queries on large-scale RDF data.

Key words: property graph, SPARQL query, Spark GraphX, query tree