Journal of Frontiers of Computer Science and Technology ›› 2010, Vol. 4 ›› Issue (8): 673-682.DOI: 10.3778/j.issn.1673-9418.2010.08.001

• 学术研究 • Previous Articles     Next Articles

Programming the VFDT Algorithm in Data Stream Manage-ment System*

YUAN Lei1, ZHANG Yang2+, LI Mei1, LI Xue3, WANG Yong4   

  1. 1. College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China
    2. College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China
    3. School of Information Technology and Electrical Engineering, University of Queensland, Brisbane 4072, Australia
    4. School of Computer, Northwestern Polytechnical University, Xi’an 710072, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-08-10 Published:2010-08-10
  • Contact: ZHANG Yang

在数据流管理系统中实现快速决策树算法*

袁 磊1, 张 阳2+, 李 梅1, 李 雪3, 王 勇4   

  1. 1. 西北农林科技大学 机械与电子工程学院, 陕西 杨凌 712100
    2. 西北农林科技大学 信息工程学院, 陕西 杨凌 712100
    3. 昆士兰大学 信息技术与电子工程系, 布里斯班 4072, 澳大利亚
    4. 西北工业大学 计算机学院, 西安 710072
  • 通讯作者: 张 阳

Abstract: Integrating data stream mining algorithm with data stream management system (DSMS) is a novel challenge for data mining and database researchers. But the integration of very fast decision tree (VFDT) with data stream management has not been reported till now. This paper focuses on integrating VFDT algorithm with Esper by exploiting capabilities of data stream management system (DSMS). How to transform the algorithm into efficient Esper query language (EQL) is analyzed, and two implementations for integrating the popular VFDT algorithm with DSMS are proposed: Transforming the VFDT algorithm into EQL straightforwardly (denoted by DVFDT); an opti-mized version of DVFDT based on the inherent batch mode of Esper (denoted by optimal-DVFDT). The proposed implementations with VFDT based on Java (denoted by JVFDT) in terms of classification accuracy and performance are compared. Experiments on a set of large volume of synthetic data show the implementation works efficiently and accurately. In addition, this approach also has better performance for the sub-streams of the original data stream.

Key words: data stream management system (DSMS), very fast decision tree (VFDT) algorithm, integration, classification

摘要: 在数据流管理系统(data stream management system, DSMS)中嵌入数据挖掘算法对数据库研究者是一项新的挑战, 而在数据流管理系统中嵌入快速决策树(very fast decision tree, VFDT), 尚未见报道。利用DSMS原有的机制在Esper中实现了VFDT算法。其主要思想是将VFDT算法转换为Esper的数据查询语言(Esper query language, EQL)。给出了在DSMS中实现VFDT算法的两种方法:普通方法。直接将VFDT算法转化为EQL语言并在DSMS中实现(记作DVFDT); 改进方法。通过Esper中固有的批量处理模式来实现(记作optimal-DVFDT)。通过一系列实验比较分析了两种方法对海量数据流分类的准确率和性能; 将提出的两种方法与用Java实现的VFDT算法(记作JVFDT)在分类精度和时间上进行比较。结果表明, 在DSMS中实现的VFDT算法具有较好的性能, 并且该算法对大规模数据流数据的子集同样具有较高的性能。

关键词: 数据管理系统, VFDT算法, 嵌入, 分类

CLC Number: