计算机科学与探索 ›› 2016, Vol. 10 ›› Issue (12): 1683-1692.DOI: 10.3778/j.issn.1673-9418.1608039

• 数据库技术 • 上一篇    下一篇

大数据分类挖掘算法及其概念漂移应用研究

陆莉莉1+,张永潘2,谈海宇2,季一木2   

  1. 1. 南京信息职业技术学院 计算机与软件学院,南京 210023
    2. 南京邮电大学 计算机学院,南京 210023
  • 出版日期:2016-12-01 发布日期:2016-12-07

Research on Classification Algorithm and Concept Drift Based on Big Data

LU Lili1+, ZHANG Yongpan2, TAN Haiyu2, JI Yimu2   

  1. 1. Institute of Computer & Software, Nanjing College of Information Technology, Nanjing 210023, China
    2. School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
  • Online:2016-12-01 Published:2016-12-07

摘要: 随着大数据应用研究的不断深入和分布式机器学习中流计算框架的涌现,针对数据流中概念漂移问题的研究是面向大数据挖掘领域的研究热点之一。现有的针对概念漂移的研究成果主要还是依赖于数据结构和算法优化,通过计算资源有限的独立计算机完成概念漂移的检测。为此,提出一种面向大数据的基于Storm的抵抗概念漂移的分类挖掘算法S-CVFDT(Storm-concept very fast decision tree)及系统。该系统采用并行化窗口和S-CVFDT算法,利用并行化窗口机制检测数据流中的突变型概念漂移,从而自适应地改变并行窗口大小,并通过S-CVFDT算法不断更新渐进性概念漂移时的模型。分析与实验结果表明,该算法可以快速有效地检测到突变型概念漂移,降低系统因为突变型概念漂移造成的资源浪费,且模型建立效率、分类精度得到提高。

关键词: 大数据, 数据挖掘, 分类算法, 概念漂移

Abstract: With the deepening research of the application on big data and the emergence of more and more distributed computing framework, the research on concept drift in data stream becomes one of the research highlights in data mining for big data.The existing research on concept drift mainly depends on the data structure and algorithm optimization, the calculation mainly depends on the sole computer and limited resources to complete concept drift detection. Thus,this paper proposes a classification mining algorithm and system for big data based on Storm to resist concept drift. The S-CVFDT (Storm-concept very fast decision tree) algorithm system uses the parallel window mechanism to detect mutant concept drift in data stream and adaptively changes the parallel window size so as to update S-CVFDT algorithm model. The experimental analysis and results show that the algorithm can effectively detect mutant concept drift and lower the system resources waste. Not only the modeling is more efficient, but also the classification accuracy is improved.

Key words: big data, data mining, classification algorithm, concept drift