计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (2): 337-347.DOI: 10.3778/j.issn.1673-9418.2009100

• 数据库技术 • 上一篇    下一篇

分布式环境下大规模维表关联技术优化

赵恒泰1, 赵宇海1,+(), 袁野2, 季航旭1, 乔百友1, 王国仁2   

  1. 1.东北大学 计算机科学与工程学院,沈阳 110169
    2.北京理工大学 计算机学院,北京 100081
  • 收稿日期:2020-08-06 修回日期:2020-10-14 出版日期:2022-02-01 发布日期:2020-11-05
  • 通讯作者: + E-mail: zhaoyuhai@ise.neu.edu.cn
  • 作者简介:赵恒泰(1996—),男,河南洛阳人,硕士研究生,主要研究方向为分布式数据管理、分布式计算等。
    赵宇海(1975—),男,辽宁鞍山人,博士,教授,博士生导师,主要研究方向为机器学习、社交网络分析等。
    袁野(1981—),男,辽宁沈阳人,博士,教授,主要研究方向为图数据库、概率数据库、社交网络分析等。
    乔百友(1970—),男,甘肃礼县人,博士,副教授,博士生导师,主要研究方向为云计算、虚拟化技术、大数据、空间数据管理技术等。
    王国仁(1966—),男,湖北崇阳人,博士,教授,博士生导师,主要研究方向为XML数据管理、查询处理与优化、高维索引、并行数据库系统、P2P数据管理等。
  • 基金资助:
    国家重点研发计划(2018YFB1004402);国家重点研发计划(2016YFCl401900)

Optimization for Large-Scale Dimension Table Connection Technology in Distributed Environment

ZHAO Hengtai1, ZHAO Yuhai1,+(), YUAN Ye2, JI Hangxu1, QIAO Baiyou1, WANG Guoren2   

  1. 1. School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
    2. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
  • Received:2020-08-06 Revised:2020-10-14 Online:2022-02-01 Published:2020-11-05
  • About author:ZHAO Hengtai, born in 1996, M.S. candidate. His research interests include distributed data management, distributed computing, etc.
    ZHAO Yuhai, born in 1975, Ph.D., professor, Ph.D. supervisor. His research interests include machine learning, social network analysis, etc.
    YUAN Ye, born in 1981, Ph.D., professor. His research interests include graph databases, pro-babilistic databases, social network analysis, etc.
    季航旭(1990—),男,辽宁沈阳人,博士研究生,主要研究方向为图嵌入、分布式计算等。JI Hangxu, born in 1990, Ph.D. candidate. His research interests include graph embedding, dis-tributed computing, etc.
    QIAO Baiyou, born in 1970, Ph.D., associate professor, Ph.D. supervisor. His research interests include cloud computing, virtualization techno-logy, big data, spatial data management, etc.
    WANG Guoren, born in 1966, Ph.D., profes-sor, Ph.D. supervisor. His research interests include XML data management, query proces-sing and optimization, high-dimensional indexing, parallel database systems, P2P data manage-ment, etc.
  • Supported by:
    National Key Research and Development Program of China(2018YFB1004402);National Key Research and Development Program of China(2016YFCl401900)

摘要:

分布式环境下大规模维表关联技术是当前在线大数据分析的关键技术之一,其广泛应用于实时推荐、实时分析等领域。维表关联是指将流数据和离线存储的维表数据进行关联,并根据这种关联进行数据处理。首先,对已有的维表连接技术方案进行了研究,调研了相关的优化技术和主流分布式引擎的设计路线,主要通过优化维表数据查询提高性能,但传统的优化方式受到维表规模和数据流速的限制。其次,针对已有优化技术在分布式环境下对集群整体考虑使用的不足,提出了适用于对离线的批数据和实时的流数据进行混合计算的计算模型,然后提出了一种单点读取维表数据,切分后进行分发和计算的维表关联数据方式,并优化了维表关联计算逻辑,使之能适应更高的维表规模,且不再局限于对数据的连接。最后,在流计算引擎Apache Flink上实现了提出的维表关联技术和传统维表关联技术,通过实验在阿里巴巴“双十一”产生的数据上对吞吐量和延迟进行了对比,证明了对面向分布式流计算的维表关联技术的优化的有效性。

关键词: 分布式计算, 维表关联, 缓存技术, Apache Flink

Abstract:

The large-scale dimension table connection technology in the distributed environment is one of the key technologies in online big data analysis, which is widely used in real-time recommendation, real-time analysis and other fields. The dimension table connection indicates that stream data and dimension tables stored offline will be connected to be processed accordingly. Firstly, this paper studies the existing dimension table connection technology and surveys the design of relevant optimization technologies and mainstream distributed engines. The traditional way of improving performance is optimizing dimension table data query. Traditional optimization is limited to the scale of the dimension table and data stream rate. Secondly, in terms of the inefficient usage of existent optimization technologies’ consideration for the whole cluster in distributed environment, this paper puts forward a computing model suitable for hybrid calculation of offline batch data and real-time stream data. This paper proposes a method of dimension table associated data cache, which reads dimension table data from a single node and distributes and calculates the data after it is segmented. This paper also optimizes the computing logic of dimension table connection so that a higher-level scale of the dimension table is applied, and the data connection limitation is overcome. Finally, the dimension table connection technology in this paper and the traditional dimension table connection technology have been implemented in Apache Flink. The optimization for dimension table connection of distributed stream computing in this paper has been verified via the experiment of comparing throughput and latency based on dataset from Double 11 Shopping Carnival of Alibaba Group.

Key words: distributed computing, dimension table connection, cache technology, Apache Flink

中图分类号: