Journal of Frontiers of Computer Science and Technology ›› 2010, Vol. 4 ›› Issue (2): 180-190.DOI: 10.3778/j.issn.1673-9418.2010.02.010

• 学术研究 • Previous Articles    

DMGrid: A Data Mining System Based on Grid Computing

WANG Yi+, XU Liutong, YANG Shengqi   

  1. Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-02-15 Published:2010-02-15
  • Contact: WANG Yi

DMGrid:基于网格计算的数据挖掘系统

王 翼+,徐六通,杨胜琦   

  1. 北京邮电大学 智能通信软件与多媒体北京市重点实验室, 北京 100876
  • 通讯作者: 王 翼

Abstract: The field of data mining now confronts a common problem that data mining tasks are time-consuming in that these tasks have to process large-scale datasets. Grid computing focuses on integrating distributed, heterogeneous and idle computers from the Internet to be a service system with high performance. Thus, it is possible to take advantage of grid computing to provide high performance computation capability to effectively reduce task durations. Here, DMGrid, a grid handling data mining applications, has been successfully developed. In DMGrid, it not only considers efficient parallel computing as a crucial aspect, but also takes into account dynamic resource configuration. Unlike many existing data mining grids, DMGrid also provides an engine to execute the algorithm flow specified in an application. Moreover, it offers application of execution monitoring. At last, the feasibility of DMGrid is validated by performing experiments, and two applications are designed: Customer churning analysis and customer value analysis.

Key words: grid computing, data mining, dynamic configuration, data flow, execution monitoring

摘要: 数据挖掘工作面临一个问题:由于数据挖掘任务需要处理大规模数据,导致任务执行时间过长。网格计算的研究目标就是将分散的、异构的、闲置的计算机结合为一个高性能的计算机系统,因此可以利用网格系统提供的高性能计算能力来有效降低数据处理时间。提出并实现基于网格计算的数据挖掘系统——DMGrid。重点考虑了并行计算功能,同时考虑了网格计算资源的动态配置。和现存的数据挖掘网格不同的是,DMGrid提供了一个引擎来执行应用中设定的工作流,同时还提供了应用运行监控功能。最后在实验中通过设计两个应用程序(客户流失分析和客户价值分析),证明了DMGrid的可行性。

关键词: 网格计算, 数据挖掘, 动态配置, 工作流, 运行监控

CLC Number: