计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (1): 106-119.DOI: 10.3778/j.issn.1673-9418.2009099

• 数据库技术 • 上一篇    下一篇

面向多表数据连接投影和连接顺序的优化方法

宗枫博1, 赵宇海1,+(), 王国仁2, 季航旭1   

  1. 1.东北大学 计算机科学与工程学院,沈阳 110169
    2.北京理工大学 计算机学院,北京 100081
  • 收稿日期:2020-08-06 修回日期:2020-10-16 出版日期:2022-01-01 发布日期:2020-11-06
  • 通讯作者: + E-mail: zhaoyuhai@mail.neu.edu.cn
  • 作者简介:宗枫博(1995—),男,河北唐山人,硕士研究生,主要研究方向为大数据。
    赵宇海(1975—),男,辽宁鞍山人,博士,教授,博士生导师,CCF高级会员,主要研究方向为数据挖掘、机器学习。
    王国仁(1966—),男,湖北崇阳人,博士,教授,博士生导师,CCF高级会员,主要研究方向为不确定数据管理、分布式查询处理与优化技术等。
    季航旭(1990—),男,辽宁沈阳人,博士研究生,主要研究方向为分布式计算、网络表示学习。
  • 基金资助:
    科技部国家重点研发计划(2018YFB1004402);国家自然科学基金(61772124)

Optimization Method of Projection and Order for Multiple Tables Join

ZONG Fengbo1, ZHAO Yuhai1,+(), WANG Guoren2, JI Hangxu1   

  1. 1. School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
    2.School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
  • Received:2020-08-06 Revised:2020-10-16 Online:2022-01-01 Published:2020-11-06
  • About author:ZONG Fengbo, born in 1995, M.S. candidate. His research interest is big data.
    ZHAO Yuhai, born in 1975, Ph.D., professor, Ph.D. supervisor, senior member of CCF. His research interests include data mining and machine learning.
    WANG Guoren, born in 1966, Ph.D., professor, Ph.D. supervisor, senior member of CCF. His research interests include uncertain data mana-gement, distributed query processing and opti-mization technologies, etc.
    JI Hangxu, born in 1990, Ph.D. candidate. His research interests include distributed computing and network representation learning.
  • Supported by:
    National Key Research and Development Program of China(2018YFB1004402);National Natural Science Foundation of China(61772124)

摘要:

多表连接运算是大数据处理中常见的运算。类似于数据库运算中常见的连接操作,多表连接运算的顺序会对计算资源和传输资源的消耗产生巨大影响。对多表连接顺序的优化是一个经典的优化问题,同时每次连接中表的投影结果大小也会影响节点间传输的数据体积,因此整体连接的顺序和每次连接的投影关系都会对连接效率产生显著的影响,而在传统的优化策略中,往往不会考虑到中间投影关系的取舍问题,以及基于中间投影关系而对最优连接策略产生的影响。针对这个问题,建立了一种连接关系索引,能够在构建优化连接策略中调整每次连接的投影关系,及时删除冗余列,减少对传输资源的消耗,同时基于投影关系的优化调整连接顺序的优化策略,从全局考量上尽可能地同时减少对传输资源和计算资源的消耗。该优化策略在Flink系统实现后进行了实验,结果表明有显著的优化效果。

关键词: 大数据, 连接优化, 投影优化

Abstract:

Multiple tables join operation is a common operation in big data processing. Similar to the common Join operations in database operations, the order of multiple tables join operation will have a great impact on the consumption of computing resources and transmission resources. The optimization of the join order of multiple tables is a classical optimization problem, and the size of the projection result of the table in each join will also affect the data volume transmitted between nodes. Therefore, the overall connection order and the projection relationship of each connection will have a significant impact on the join efficiency. But in the traditional optimiza-tion strategy, the choice of intermediate projection relation, and the influence on the optimal join strategy based on the intermediate projection relation are often not considered. In order to solve this problem, this paper establishes a connection relation index, which can adjust the projection relation of each join in the construction optimization connection strategy, delete redundant columns in time, and reduce the consumption of transmission resources. At the same time, the optimization strategy of adjusting join order based on projection relation can reduce the consumption of transmission resources and computing resources as much as possible. After the implementation in the Flink system, the optimization strategy is tested, and the results show that it has a significant optimization effect.

Key words: big data, join optimization, project optimization

中图分类号: