分布式环境下大规模维表关联技术优化

doi:10.3778/j.issn.1673-9418.2009100

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (2): 337-347.DOI: 10.3778/j.issn.1673-9418.2009100

分布式环境下大规模维表关联技术优化

赵恒泰¹, 赵宇海¹^,⁺(), 袁野², 季航旭¹, 乔百友¹, 王国仁²

1.东北大学计算机科学与工程学院,沈阳 110169
2.北京理工大学计算机学院,北京 100081

收稿日期:2020-08-06 修回日期:2020-10-14 出版日期:2022-02-01 发布日期:2020-11-05
通讯作者: + E-mail: zhaoyuhai@ise.neu.edu.cn
作者简介:赵恒泰（1996—）,男,河南洛阳人,硕士研究生,主要研究方向为分布式数据管理、分布式计算等。
赵宇海（1975—）,男,辽宁鞍山人,博士,教授,博士生导师,主要研究方向为机器学习、社交网络分析等。
袁野（1981—）,男,辽宁沈阳人,博士,教授,主要研究方向为图数据库、概率数据库、社交网络分析等。
乔百友（1970—）,男,甘肃礼县人,博士,副教授,博士生导师,主要研究方向为云计算、虚拟化技术、大数据、空间数据管理技术等。
王国仁（1966—）,男,湖北崇阳人,博士,教授,博士生导师,主要研究方向为XML数据管理、查询处理与优化、高维索引、并行数据库系统、P2P数据管理等。
基金资助:
国家重点研发计划(2018YFB1004402);国家重点研发计划(2016YFCl401900)

Optimization for Large-Scale Dimension Table Connection Technology in Distributed Environment

ZHAO Hengtai¹, ZHAO Yuhai¹^,⁺(), YUAN Ye², JI Hangxu¹, QIAO Baiyou¹, WANG Guoren²

1. School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
2. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

Received:2020-08-06 Revised:2020-10-14 Online:2022-02-01 Published:2020-11-05
About author:ZHAO Hengtai, born in 1996, M.S. candidate. His research interests include distributed data management, distributed computing, etc.
ZHAO Yuhai, born in 1975, Ph.D., professor, Ph.D. supervisor. His research interests include machine learning, social network analysis, etc.
YUAN Ye, born in 1981, Ph.D., professor. His research interests include graph databases, pro-babilistic databases, social network analysis, etc.
季航旭（1990—）,男,辽宁沈阳人,博士研究生,主要研究方向为图嵌入、分布式计算等。JI Hangxu, born in 1990, Ph.D. candidate. His research interests include graph embedding, dis-tributed computing, etc.
QIAO Baiyou, born in 1970, Ph.D., associate professor, Ph.D. supervisor. His research interests include cloud computing, virtualization techno-logy, big data, spatial data management, etc.
WANG Guoren, born in 1966, Ph.D., profes-sor, Ph.D. supervisor. His research interests include XML data management, query proces-sing and optimization, high-dimensional indexing, parallel database systems, P2P data manage-ment, etc.
Supported by:
National Key Research and Development Program of China(2018YFB1004402);National Key Research and Development Program of China(2016YFCl401900)

摘要/Abstract

摘要：

分布式环境下大规模维表关联技术是当前在线大数据分析的关键技术之一,其广泛应用于实时推荐、实时分析等领域。维表关联是指将流数据和离线存储的维表数据进行关联,并根据这种关联进行数据处理。首先,对已有的维表连接技术方案进行了研究,调研了相关的优化技术和主流分布式引擎的设计路线,主要通过优化维表数据查询提高性能,但传统的优化方式受到维表规模和数据流速的限制。其次,针对已有优化技术在分布式环境下对集群整体考虑使用的不足,提出了适用于对离线的批数据和实时的流数据进行混合计算的计算模型,然后提出了一种单点读取维表数据,切分后进行分发和计算的维表关联数据方式,并优化了维表关联计算逻辑,使之能适应更高的维表规模,且不再局限于对数据的连接。最后,在流计算引擎Apache Flink上实现了提出的维表关联技术和传统维表关联技术,通过实验在阿里巴巴“双十一”产生的数据上对吞吐量和延迟进行了对比,证明了对面向分布式流计算的维表关联技术的优化的有效性。

关键词: 分布式计算, 维表关联, 缓存技术, Apache Flink

Abstract:

The large-scale dimension table connection technology in the distributed environment is one of the key technologies in online big data analysis, which is widely used in real-time recommendation, real-time analysis and other fields. The dimension table connection indicates that stream data and dimension tables stored offline will be connected to be processed accordingly. Firstly, this paper studies the existing dimension table connection technology and surveys the design of relevant optimization technologies and mainstream distributed engines. The traditional way of improving performance is optimizing dimension table data query. Traditional optimization is limited to the scale of the dimension table and data stream rate. Secondly, in terms of the inefficient usage of existent optimization technologies’ consideration for the whole cluster in distributed environment, this paper puts forward a computing model suitable for hybrid calculation of offline batch data and real-time stream data. This paper proposes a method of dimension table associated data cache, which reads dimension table data from a single node and distributes and calculates the data after it is segmented. This paper also optimizes the computing logic of dimension table connection so that a higher-level scale of the dimension table is applied, and the data connection limitation is overcome. Finally, the dimension table connection technology in this paper and the traditional dimension table connection technology have been implemented in Apache Flink. The optimization for dimension table connection of distributed stream computing in this paper has been verified via the experiment of comparing throughput and latency based on dataset from Double 11 Shopping Carnival of Alibaba Group.

Key words: distributed computing, dimension table connection, cache technology, Apache Flink

中图分类号:

TP311

赵恒泰, 赵宇海, 袁野, 季航旭, 乔百友, 王国仁. 分布式环境下大规模维表关联技术优化[J]. 计算机科学与探索, 2022, 16(2): 337-347.

ZHAO Hengtai, ZHAO Yuhai, YUAN Ye, JI Hangxu, QIAO Baiyou, WANG Guoren. Optimization for Large-Scale Dimension Table Connection Technology in Distributed Environment[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(2): 337-347.

图/表 10

图1 传统维表关联逻辑

Fig.1 Traditional dimension table connection

图2 优化的维表关联逻辑

Fig.2 Optimized dimension table connection

图3 批式计算架构

Fig.3 Batch computing architecture

图4 流式计算架构

Fig.4 Stream computing architecture

图5 混合计算架构

Fig.5 Mixed computing architecture

表1 数据表大小

Table 1 Size of data table

表名	记录数/10⁴	数据容量/MB
用户信息表	1 000	11 969.00
商品信息表	1 000	11 115.00
用户点击数据表	10 000	11 118.06

图6 实验1平均吞吐量统计

Fig.6 Statistics of experiment 1 mean throughput

图7 实验2平均吞吐量统计

Fig.7 Statistics of experiment 2 mean throughput

图8 实验1平均延迟

Fig.8 Experiment 1 mean delay

图9 实验2平均延迟

Fig.9 Experiment 2 mean delay

参考文献 22

[1]	LAVALLE S, LESSER E, SHOCKLEY R, et al. Big data, analytics and the path from insights to value[J]. MIT Sloan Management Review, 2011, 52(2):21-32.
[2]	WALKER S J. Big data: a revolution that will transform how we live, work, and think[J]. Mathematics & Computer Education, 2014, 47(17):181-183.
[3]	TAYLOR R C. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics[J]. BMC Bioinformatics, 2010, 11:S1.
[4]	DEAN J, GHEMAWAT S. MapReduce: a flexible data pro-cessing tool[J]. Communications of the ACM, 2010, 53(1):72-77.
[5]	崔星灿, 禹晓辉, 刘洋, 等. 分布式流处理技术综述[J]. 计算机研究与发展, 2015, 52(2):318-332.
	CUI X C, YU X H, LIU Y, et al. Distributed stream proces-sing: a survey[J]. Journal of Computer Research and Deve-lopment, 2015, 52(2):318-332.
[6]	ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: cluster computing with working sets[C]// Procee-dings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, Boston, Jun 22, 2010.
[7]	KATSIFODIMOS A, SCHELTER S. Apache Flink: stream analytics at scale[C]//Proceedings of the 2016 IEEE Interna-tional Conference on Cloud Engineering Workshop, Berlin, Apr 4-8, 2016. Washington: IEEE Computer Society, 2016: 193.
[8]	杨莉国, 欧付娜, 刘庆海, 等. 数据仓库相关技术研究综述[J]. 电脑知识与技术, 2011, 7(10):2234-2236.
	YANG L G, OU F N, LIU Q H, et al. Research related tech-nology on data warehouse[J]. Computer Knowledge and Technology, 2011, 7(10):2234-2236.
[9]	POLYZOTIS N, SKIADOPOULOS S, VASSILIADIS P, et al. Meshing streaming updates with persistent data in an active data warehouse[J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(7):976-991. DOI URL
[10]	NAEEM M, DOBBIE G, WEBER G. R-MESHJOIN for near-real-time data warehousing[C]//Proceedings of the 13th International Workshop on Data Warehousing and OLAP, Toronto, Oct 30, 2010. New York: ACM, 2010: 53-60.
[11]	CHAKRABORTY A, SINGH A. A partition-based approach to support streaming updates over persistent data in an active data warehouse[C]//Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Pro-cessing, Rome, May 23-29, 2009. Piscataway: IEEE, 2009: 1-11.
[12]	林子雨, 林琛, 冯少荣, 等. MESHJOIN∗: 实时数据仓库环境下的数据流更新算法[J]. 计算机科学与探索, 2010, 4(10):927-939.
	LIN Z Y, LIN C, FENG S R, et al. MESHJOIN*: an algo-rithm supporting streaming updates in a real-time data ware-house[J]. Journal of Frontiers of Computer Science and Technology, 2010, 4(10):927-939.
[13]	潘郑冰, 戴牡红. 实时数据仓库中一种改进的数据流更新算法[J]. 计算机工程, 2014, 40(10):43-46.
	PAN Z B, DAI M H. An improved data stream update al-gorithm in real-time data warehouse[J]. Computer Enginee-ring, 2014, 40(10):43-46.
[14]	NAEEM M A, BAJWA I S, JAMIL N. A cached-based stream-relation join operator for semi-stream data processing[J]. International Journal of Data Warehousing and Mining, 2016, 12(3):14-31. DOI URL
[15]	ZAHARIA M, DAS T, LI H, et al. Discretized streams: fault-tolerant streaming computation at scale[C]//Procee-dings of the 24th ACM Symposium on Operating Systems Principles, Farmington, Nov 3-6, 2013. New York: ACM, 2013: 423-438.
[16]	ARMBRUST M, DAS T, TORRES J, et al. Structured streaming: a declarative API for real-time applications in Apache Spark[C]//Proceedings of the 2018 International Conference on Management of Data, Houston, Jun 10-15, 2018. New York: ACM, 2018: 601-613.
[17]	FRICKER C, ROBERT P, ROBERTS J. A versatile and accurate approximation for LRU cache performance[C]//Proceedings of the 2012 24th International Teletraffic Con-gress, Kraków, Sep 4-7, 2012. Piscataway: IEEE, 2012: 1-8.
[18]	BERTOLUCCI M, CARLINI E, DAZZI P, et al. Static and dynamic big data partitioning on Apache Spark[C]//Procee-dings of the 2015 International Conference on Parallel Com-puting, Edinburgh, Sep 1-4, 2015. Amsterdam: IOS Press, 2015: 489-498.
[19]	GOUNARIS A, KOUGKA G, TOUS R, et al. Dynamic configuration of partitioning in Spark applications[J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(7):1891-1904. DOI URL
[20]	MACEDO T, OLIVEIRA F. Redis Cookbook: practical tech-niques for fast data manipulation[M]. Sebastopol: O’Reilly Media, Inc., 2011.
[21]	CHINTAPALLI S, DAGIT D, EVANS B, et al. Benchmar-king streaming computation engines: storm, flink and spark streaming[C]//Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Work-shops, Chicago, May 23-27, 2016. Washington: IEEE Com-puter Society, 2016: 1789-1792.
[22]	HIRES S D, TABACZYNSKI R J, NOVAK J M. The pre-diction of ignition delay and combustion intervals for a homo-geneous charge, spark ignition engine[C]// Proceedings of the 1978 Automotive Engineering Congress and Exposition, 1978.

分布式环境下大规模维表关联技术优化

Optimization for Large-Scale Dimension Table Connection Technology in Distributed Environment

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 22

相关文章 5

编辑推荐

Metrics

[1]	赵守月，葛洪伟. MEPaxos：低延迟的共识算法[J]. 计算机科学与探索, 2019, 13(5): 866-874.
[2]	郭羽含，胡芳霞. 考虑匹配可行性的长期合乘问题建模与求解[J]. 计算机科学与探索, 2019, 13(11): 1894-1910.
[3]	王建飞，亢良伊，刘杰，叶丹. 分布式随机方差消减梯度下降算法topkSVRG[J]. 计算机科学与探索, 2018, 12(7): 1047-1054.
[4]	时生乐，赵宇海，李源，印莹，王国仁. 一种有效的基于GraphX的分布式结构化图聚类算法[J]. 计算机科学与探索, 2018, 12(10): 1571-1582.
[5]	吴志川，毛琛，韩蕾，陈立军. 高度可伸缩的稀疏矩阵乘法[J]. 计算机科学与探索, 2013, 7(11): 973-982.