计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (5): 775-787.DOI: 10.3778/j.issn.1673-9418.1807039

• 数据库技术 • 上一篇    下一篇

海量数据上有效的top-k Skyline查询算法

韩希先+,宋  翠,戈韵如,高  宏,李建中   

  1. 哈尔滨工业大学 计算机科学与技术学院,哈尔滨 150001
  • 出版日期:2019-05-01 发布日期:2019-05-08

Efficient top-k Skyline Query Algorithm on Massive Data

HAN Xixian+, SONG Cui, GE Yunru, GAO Hong, LI Jianzhong   

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • Online:2019-05-01 Published:2019-05-08

摘要: 在许多应用中,Skyline查询是一种十分重要的查询类型,它在潜在的巨大的数据空间中返回不被其他元组支配的用户感兴趣的元组,但是Skyline查询无法控制返回结果的数量。处理一个新的top-k Skyline查询问题,该查询返回支配分数最大的k个Skyline元组,从而控制了需要向用户返回的查询结果数量。分析发现,大多数现有算法忽略了利用支配分数作为限制Skyline查询的结果数量的度量。提出一个新的基于表扫描的RSTS(ranked Skyline with table scan)算法来有效计算海量数据上的top-k Skyline结果。RSTS算法首先对表执行预排序操作,保证预排序表的元组按照对有序列表的round-robin扫描的顺序排列。RSTS算法包括两个阶段。阶段1利用对预排序表的顺序扫描来获得候选元组。阶段2计算候选元组的支配分数并返回结果。可以证明,RSTS算法具有早结束特性,并给出其扫描深度的理论分析。提出对于候选元组的剪切操作,理论剪切效果表明,绝大多数的Skyline结果可以直接丢弃。实验结果表明,RSTS算法可以有效计算top-k Skyline结果。

关键词: 海量数据, top-k Skyline, RSTS算法, 表扫描, 剪切操作

Abstract: In many applications, Skyline is an important operation to return a set of interesting tuples which are not dominated by other tuples in a potentially huge data space. The problem of Skyline is that it cannot control the size of Skyline results. This paper deals with a novel top-k Skyline query, which returns k Skyline tuples with the largest domination scores. It is found that the majority of existing algorithms ignore the application of domination scores as a metric of size limitation in Skyline query. This paper proposes a novel table-scan-based algorithm RSTS (ranked Skyline with table scan) to compute top-k Skyline query on massive data efficiently. RSTS first presorts table to arrange the tuples in the order of round-robin retrieval on sorted lists. The execution of RSTS consists of two phases. In phase 1, RSTS scans the presorted table sequentially and maintains the candidate tuples. In phase 2, RSTS calculates the domination scores of the candidates and returns query results. It is proven that RSTS has the characteristic of early termination, along with the theoretical analysis of scan depths. The pruning rule for candidate tuples is devised in this paper. The theoretical pruning effect shows that majority of the Skyline results can be discarded directly. The extensive experimental results show that RSTS can compute top-k Skyline results efficiently.

Key words: massive data, top-k Skyline, ranked Skyline with table scan (RSTS) algorithm, table scan, pruning operation