Journal of Frontiers of Computer Science and Technology ›› 2015, Vol. 9 ›› Issue (8): 897-905.DOI: 10.3778/j.issn.1673-9418.1412008

Previous Articles     Next Articles

Algorithm of Parallel Top-k Skyline Queries for Large Data Set

YANG Linqing1,2,3, LI Zhan1,2,3, MOU Yanchao3,4 , FAN Lilue3,4, LI Hongyan3,4+, WANG Tengjiao2,3, LEI Kai1   

  1. 1. Shenzhen Key Lab for Cloud Computing Technology and Applications, School of Electronics and Computer Engineering, Peking University, Shenzhen, Guangdong 518055, China
    2. Key Laboratory of High Confidence Software Technologies of Ministry of Education, Peking University, Beijing 100871, China
    3. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
    4. Key Laboratory of Machine Perception of Ministry of Education, Peking University, Beijing 100871, China
  • Online:2015-08-01 Published:2015-08-06

面向大规模数据集的并行化Top-k Skyline查询算法

杨林青1,2,3,李  湛1,2,3,牟雁超3,4,樊里略3,4,李红燕3,4+,王腾蛟2,3,雷  凯1   

  1. 1. 北京大学 信息工程学院 深圳市云计算关键技术与应用重点实验室,广东 深圳 518055
    2. 北京大学 高可信软件技术教育部重点实验室,北京 100871
    3. 北京大学 信息科学技术学院,北京 100871
    4. 北京大学 机器感知与智能教育部重点实验室,北京 100871

Abstract: As data of an unprecedented scale are becoming accessible, it becomes more and more important to help user identify the representative information of a manageable size. Top-k Skyline queries can find the most k representative information, and it can also control the size of the results. So the Top-k Skyline queries meet the above-mentioned requirements. However, the traditional method of Top-k Skyline query has a low efficiency when it meets a large scale of data set. In order to speed the efficiency of the query, parallel processing comes into being. Via combining the Top-k Skyline query with the parallel processing, this paper proposes a novel algorithm of parallel Top-k Skyline queries for large scale of data set, named PTKS (parallel Top-K Skyline). With exploiting the advantage of parallel computing, making the efficient query by using parallel processing, and designing a filter rule based on user preferences to reduce data size, this algorithm satisfies the needs of users on some aspects. Through the experiments on the factual data sets, compared with the existing methods, the PTKS can be applied to the large scale of data set and is superior to the other existing algorithms in large data set.

Key words: large data set, Top-k Skyline, representative information, parallel processing, filter rule

摘要: 随着数据规模的日益庞大,在大规模数据集中帮助用户定位出数据量可控的代表性信息显得越发重要。虽然Top-k Skyline查询能够找到数据集中前k个最具代表性的信息,在获取代表性信息的同时又控制了结果规模,满足了上述要求,但是现有的Top-k Skyline查询在面对大规模数据集时效率较低,并不适用于大规模数据集。为了解决这个问题,将Top-k Skyline查询与并行化处理相结合,提出了一种面向大规模数据集的并行化Top-k Skyline查询算法PTKS(parallel Top-k Skyline),通过充分利用分布式资源,将原有查询进行有效的并行化处理,同时设计了基于用户偏好的用于缩减结果数据量的筛选规则,满足用户需求。在真实数据集上进行了相关实验,并与现有方法进行了对比,结果表明PTKS在大规模数据集上的查询效率更具有优势,能很好地适用于大规模数据集。

关键词: 大规模数据集, Top-k Skyline, 代表性信息, 并行化处理, 筛选规则