计算机科学与探索 ›› 2016, Vol. 10 ›› Issue (9): 1211-1220.DOI: 10.3778/j.issn.1673-9418.1509021

• 学术研究 • 上一篇    下一篇

大数据系统开发中的构件自动选型与参数配置

钟  雨+,邱明明,黄向东   

  1. 清华大学 软件学院,北京 100084
  • 出版日期:2016-09-01 发布日期:2016-09-05

Automatic Component Selection and Parameter Configuration in Development of Big Data System

ZHONG Yu+, QIU Mingming, HUANG Xiangdong   

  1. School of Software, Tsinghua University, Beijing 100084, China
  • Online:2016-09-01 Published:2016-09-05

摘要: 大数据应用系统包含数据的采集、存储、分析、挖掘、可视化等多个技术环节,各个环节都存在多种解决方案,涉及到的各类系统有数百种之多,且系统配置较为复杂,这给企业的大数据应用系统构建带来了极大的挑战。针对大数据应用系统开发中构件选型的难题,通过建立规范化的需求指标,并采用决策树模型实现了大数据构件的自动选型。从几个主流的分布式存储系统出发,以Cassandra为例,利用多元回归拟合的方法针对硬件参数建立相应的性能模型,将用户需求作为输入,利用性能模型进行系统硬件参数配置;通过研究系统原理、架构、特点及应用场景,构建软件参数配置知识库指导软件参数的配置,从而解决了大数据系统开发中的构件自动选型和参数配置问题。

关键词: 大数据系统, 构件选型, 决策树模型, 参数配置, 性能模型

Abstract: Big data applications include data collection, storage, analysis, mining, visualization, and other technical    aspects. Every aspect has a variety of solutions, involves several hundred application systems and the system configuration is complicated, which has brought great challenges for a company to construct big data applications. To solve the problem of component selection in the development of application system, this paper establishes standardized requirement norms and achieves automatic component selection by using the components selection decision tree. This paper embarks from the several mainstream distributed storage systems, takes Cassandra as an example, conducts experiments and uses multiple regression method to calculate the performance model for hardware parameters. Then, this    paper uses the performance model to help user configure hardware parameters under the input of user’s requirements. Finally, this paper studies the system’s principle, structure and characteristics and constructs a knowledge base of software parameters configuration to help configure software parameters. In these ways the problem of component selection and parameter configuration in the development of big data system can be solved.

Key words: big data system, component selection, decision tree model, parameter configuration, performance model