计算机科学与探索 ›› 2011, Vol. 5 ›› Issue (2): 161-169.

• 学术研究 • 上一篇    下一篇

面向MapReduce的数据处理流程开发方法

易小华1,2, 刘 杰3, 叶 丹1   

  1. 1. 中国科学院 软件研究所 软件工程技术中心, 北京100190
    2. 中国科学院 研究生院, 北京100190
    3. 中国科学技术大学 计算机科学与技术系, 合肥 230026
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-02-01 发布日期:2011-02-01
  • 通讯作者: 易小华

Development Method of MapReduce Oriented Data Flow Processing

YI Xiaohua1,2, LIU Jie3, YE Dan1   

  1. 1. Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
    2. Graduate University, Chinese Academy of Sciences, Beijing 100190, China
    3. Dept. of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-02-01 Published:2011-02-01
  • Contact: YI Xiaohua

摘要: 数据处理流程在信息爆炸的今天被广泛应用并呈现出海量和并行的特点, MapReduce编程模型的简单性和高性价比使得其适用于海量数据的并行处理, 但是MapReduce不支持多数据源的数据处理, 不能直接应用于具有多个处理操作、多个数据流分支的数据处理流程。提出一种模型驱动的面向MapReduce计算模型的数据处理流程快速开发方法, 定义数据处理流程的逻辑模型、物理模型和组件模型, 使用模型转换算法和代码生成算法将逻辑模型转化为物理模型, 再转换为能直接在Hadoop平台上运行的MapReduce程序, 基于该方法实现了一个开发工具CloudDataFlow。实验表明该方法可以有效提高数据流程的处理效率。

关键词: MapReduce, 数据处理流程, 模型驱动, Hadoop平台

Abstract: In the age of information explosion, DataFlow processing widely existed and has shown new features and styles including massive and parallel, meanwhile more and more people choose to use MapReduce to process their data because of its simplicity and higher capability with lower cost, but MapReduce does not directly support complex N-step, N-branch and multiple data sets data flow processing. This paper proposes a model-driven development method for DataFlow processing based on MapReduce. It first defines the logical and physical models of the data-flow as well as the component model, then designs model transfer and code generation algorithms, finally uses the algorithms to generate the MapReduce program code which implements the function defined by the logical model and can run on Hadoop platform. Based on this method, a development tool CloudDataFlow is implemented. As the experiment shows, compared with similar system, it has higher performance, extendibility and usability.

Key words: MapReduce, data flow processing, model-driven development, Hadoop platform