Journal of Frontiers of Computer Science and Technology ›› 2019, Vol. 13 ›› Issue (9): 1493-1503.DOI: 10.3778/j.issn.1673-9418.1906053

Previous Articles     Next Articles

Research on Construction Method of Distributed Data Lake Driven by Virtualization Model

TAN Jingxin, LIU Yulong, LI Huijuan   

  1. North China Institute of Computing Technology, Beijing 100083, China
  • Online:2019-09-01 Published:2019-09-06

虚拟化模型驱动的分布式数据湖构建方法研究

谭景信刘玉龙李慧娟   

  1. 华北计算技术研究所,北京 100083

Abstract: In view of the characteristics of wide distribution, multiple types and strong uncertainty of business service objects of the Federation of Industry and Commerce, a distributed data lake construction method driven by virtualization model is proposed. The architecture of distributed data lake is given. The data virtualization model adapted to scattered and fragmented data collection scenarios and model-driven inter-database collaboration process are defined. Based on this model and the virtualized global data indexing network, the routing and coordination among edge database nodes, secondary regional database nodes and central database nodes are realized, and the distributed spoke-type data lake which can be de-ETL and de-centralization is formed. The proposed method can alleviate the problems of centralized data lake, such as poor timeliness of data updating, large storage demand, high bandwidth usage, low economy and so on. Comparisons show that the proposed method can meet the needs of large data analysis and data real-time processing of the Federation of Industry and Commerce, reducing data handling costs and improving economy.

Key words: data virtualization, model driven, data lake, distributed

摘要: 提出了适应工商联业务服务对象分布广、类型多、不确定性强等特点的虚拟化模型驱动的分布式数据湖构建方法,给出了分布式数据湖的整体架构设计,定义了适应分散、碎片化数据收集场景的数据虚拟化模型和模型驱动下的数据库间协作流程;通过构建虚拟化的全局数据索引网络,实现边缘数据库节点、二级区域数据库节点和中央数据库节点的库间路由和协调一致,形成去ETL化和去中心化的辐射型分布式数据湖,缓解了集中式数据湖构建方法所存在的数据更新时效性差、存储需求量大、频繁搬运大量数据耗费大量带宽、经济性差等诸多问题。对比测算表明,所提方法既满足了工商联分析业务对大数据的需求,又很好满足了实时处理业务对鲜活数据的需要,减少了数据搬运成本,提升了经济性。

关键词: 数据虚拟化, 模型驱动, 数据湖, 分布式