虚拟化模型驱动的分布式数据湖构建方法研究

doi:10.3778/j.issn.1673-9418.1906053

计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (9): 1493-1503.DOI: 10.3778/j.issn.1673-9418.1906053

虚拟化模型驱动的分布式数据湖构建方法研究

谭景信，刘玉龙，李慧娟

华北计算技术研究所，北京 100083

出版日期:2019-09-01 发布日期:2019-09-06

Research on Construction Method of Distributed Data Lake Driven by Virtualization Model

TAN Jingxin, LIU Yulong, LI Huijuan

North China Institute of Computing Technology, Beijing 100083, China

Online:2019-09-01 Published:2019-09-06

摘要/Abstract

摘要： 提出了适应工商联业务服务对象分布广、类型多、不确定性强等特点的虚拟化模型驱动的分布式数据湖构建方法，给出了分布式数据湖的整体架构设计，定义了适应分散、碎片化数据收集场景的数据虚拟化模型和模型驱动下的数据库间协作流程；通过构建虚拟化的全局数据索引网络，实现边缘数据库节点、二级区域数据库节点和中央数据库节点的库间路由和协调一致，形成去ETL化和去中心化的辐射型分布式数据湖，缓解了集中式数据湖构建方法所存在的数据更新时效性差、存储需求量大、频繁搬运大量数据耗费大量带宽、经济性差等诸多问题。对比测算表明，所提方法既满足了工商联分析业务对大数据的需求，又很好满足了实时处理业务对鲜活数据的需要，减少了数据搬运成本，提升了经济性。

关键词: 数据虚拟化, 模型驱动, 数据湖, 分布式

Abstract: In view of the characteristics of wide distribution, multiple types and strong uncertainty of business service objects of the Federation of Industry and Commerce, a distributed data lake construction method driven by virtualization model is proposed. The architecture of distributed data lake is given. The data virtualization model adapted to scattered and fragmented data collection scenarios and model-driven inter-database collaboration process are defined. Based on this model and the virtualized global data indexing network, the routing and coordination among edge database nodes, secondary regional database nodes and central database nodes are realized, and the distributed spoke-type data lake which can be de-ETL and de-centralization is formed. The proposed method can alleviate the problems of centralized data lake, such as poor timeliness of data updating, large storage demand, high bandwidth usage, low economy and so on. Comparisons show that the proposed method can meet the needs of large data analysis and data real-time processing of the Federation of Industry and Commerce, reducing data handling costs and improving economy.

Key words: data virtualization, model driven, data lake, distributed

谭景信，刘玉龙，李慧娟. 虚拟化模型驱动的分布式数据湖构建方法研究[J]. 计算机科学与探索, 2019, 13(9): 1493-1503.

TAN Jingxin, LIU Yulong, LI Huijuan. Research on Construction Method of Distributed Data Lake Driven by Virtualization Model[J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(9): 1493-1503.

[1]	高健，魏峻，许利杰，汪保龙，杨富学，黄骁飞. 基于预分区策略的装备数据分布式存储方法[J]. 计算机科学与探索, 2021, 15(1): 96-108.
[2]	邵必林，贺金能，边根庆. 基于多目标分解策略的副本布局算法研究[J]. 计算机科学与探索, 2020, 14(9): 1490-1500.
[3]	冯雅妮，蒋林，山蕊，刘阳，张园. 改进的阵列处理器数据Cache实时动态迁移机制[J]. 计算机科学与探索, 2020, 14(12): 2028-2038.
[4]	周健，孙丽艳，付明. 抗货币失效的区块链钱包保护协议研究[J]. 计算机科学与探索, 2020, 14(12): 2039-2049.
[5]	张晓琳，袁昊晨，李卓麟，张换香，刘娇. 面向子图匹配的社会网络隐私保护方法[J]. 计算机科学与探索, 2019, 13(9): 1504-1515.
[6]	赵守月，葛洪伟. MEPaxos：低延迟的共识算法[J]. 计算机科学与探索, 2019, 13(5): 866-874.
[7]	郭羽含，胡芳霞. 考虑匹配可行性的长期合乘问题建模与求解[J]. 计算机科学与探索, 2019, 13(11): 1894-1910.
[8]	张晓琳，何晓玉，张换香，李卓麟. PLRD-(k,m):保护链接关系的分布式k-度-m-标签匿名方法[J]. 计算机科学与探索, 2019, 13(1): 70-82.
[9]	王建飞，亢良伊，刘杰，叶丹. 分布式随机方差消减梯度下降算法topkSVRG[J]. 计算机科学与探索, 2018, 12(7): 1047-1054.
[10]	甘瀛，王鑫，冯志勇，杨雅君. 基于Pregel模型的分布式图着色算法[J]. 计算机科学与探索, 2018, 12(6): 886-897.
[11]	徐京京. 直觉主义认知逻辑ICDK[J]. 计算机科学与探索, 2018, 12(11): 1843-1851.
[12]	时生乐，赵宇海，李源，印莹，王国仁. 一种有效的基于GraphX的分布式结构化图聚类算法[J]. 计算机科学与探索, 2018, 12(10): 1571-1582.
[13]	朱命冬，徐立新，申德荣，寇月，聂铁铮. 面向不确定文本数据的余弦相似性查询方法[J]. 计算机科学与探索, 2018, 12(1): 49-64.
[14]	季艳，鲁克文，张英慧. 海量遥感数据分布式集群化存储技术研究[J]. 计算机科学与探索, 2017, 11(9): 1398-1404.
[15]	张飞朋，陈琳，张京京. 面向大规模复杂网络测量和性能瓶颈分析方法[J]. 计算机科学与探索, 2017, 11(2): 262-270.

虚拟化模型驱动的分布式数据湖构建方法研究

Research on Construction Method of Distributed Data Lake Driven by Virtualization Model

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics