大数据分类挖掘算法及其概念漂移应用研究

doi:10.3778/j.issn.1673-9418.1608039

计算机科学与探索 ›› 2016, Vol. 10 ›› Issue (12): 1683-1692.DOI: 10.3778/j.issn.1673-9418.1608039

大数据分类挖掘算法及其概念漂移应用研究

陆莉莉1+，张永潘2，谈海宇2，季一木2

1. 南京信息职业技术学院计算机与软件学院，南京 210023
2. 南京邮电大学计算机学院，南京 210023

出版日期:2016-12-01 发布日期:2016-12-07

Research on Classification Algorithm and Concept Drift Based on Big Data

LU Lili1+, ZHANG Yongpan2, TAN Haiyu2, JI Yimu2

1. Institute of Computer & Software, Nanjing College of Information Technology, Nanjing 210023, China
2. School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China

Online:2016-12-01 Published:2016-12-07

摘要/Abstract

摘要： 随着大数据应用研究的不断深入和分布式机器学习中流计算框架的涌现，针对数据流中概念漂移问题的研究是面向大数据挖掘领域的研究热点之一。现有的针对概念漂移的研究成果主要还是依赖于数据结构和算法优化，通过计算资源有限的独立计算机完成概念漂移的检测。为此，提出一种面向大数据的基于Storm的抵抗概念漂移的分类挖掘算法S-CVFDT（Storm-concept very fast decision tree）及系统。该系统采用并行化窗口和S-CVFDT算法，利用并行化窗口机制检测数据流中的突变型概念漂移，从而自适应地改变并行窗口大小，并通过S-CVFDT算法不断更新渐进性概念漂移时的模型。分析与实验结果表明，该算法可以快速有效地检测到突变型概念漂移，降低系统因为突变型概念漂移造成的资源浪费，且模型建立效率、分类精度得到提高。

关键词: 大数据, 数据挖掘, 分类算法, 概念漂移

Abstract: With the deepening research of the application on big data and the emergence of more and more distributed computing framework, the research on concept drift in data stream becomes one of the research highlights in data mining for big data.The existing research on concept drift mainly depends on the data structure and algorithm optimization, the calculation mainly depends on the sole computer and limited resources to complete concept drift detection. Thus,this paper proposes a classification mining algorithm and system for big data based on Storm to resist concept drift. The S-CVFDT (Storm-concept very fast decision tree) algorithm system uses the parallel window mechanism to detect mutant concept drift in data stream and adaptively changes the parallel window size so as to update S-CVFDT algorithm model. The experimental analysis and results show that the algorithm can effectively detect mutant concept drift and lower the system resources waste. Not only the modeling is more efficient, but also the classification accuracy is improved.

Key words: big data, data mining, classification algorithm, concept drift

陆莉莉，张永潘，谈海宇，季一木. 大数据分类挖掘算法及其概念漂移应用研究[J]. 计算机科学与探索, 2016, 10(12): 1683-1692.

LU Lili, ZHANG Yongpan, TAN Haiyu, JI Yimu. Research on Classification Algorithm and Concept Drift Based on Big Data[J]. Journal of Frontiers of Computer Science and Technology, 2016, 10(12): 1683-1692.

[1]	陈剑南, 杜军平, 薛哲, 寇菲菲. 基于多重注意力的金融事件大数据精准画像[J]. 计算机科学与探索, 2021, 15(7): 1237-1244.
[2]	赵学武, 吴宁, 王军, 阮利, 李玲玲, 徐涛. 航空大数据研究综述[J]. 计算机科学与探索, 2021, 15(6): 999-1025.
[3]	徐霁琳, 徐健锋, 刘龙, 吴方文. 面向滑动窗口法的概念格漂移计算研究[J]. 计算机科学与探索, 2021, 15(6): 1145-1154.
[4]	郭子菁, 罗玉川, 蔡志平, 郑腾飞. 医疗健康大数据隐私保护综述[J]. 计算机科学与探索, 2021, 15(3): 389-402.
[5]	郑娅峰, 赵亚宁, 白雪, 傅骞. 教育大数据可视化研究综述[J]. 计算机科学与探索, 2021, 15(3): 403-422.
[6]	孙冬璞, 曲丽. 时间序列特征表示与相似性度量研究综述[J]. 计算机科学与探索, 2021, 15(2): 195-205.
[7]	王光耀, 王丽珍, 杨培忠, 陈红梅. 极小负co-location模式及有效的挖掘算法[J]. 计算机科学与探索, 2021, 15(2): 366-378.
[8]	王沐贤，丁小欧，王宏志，李建中. 基于相关性的多维时序数据异常溯源方法[J]. 计算机科学与探索, 2021, 15(11): 2142-2150.
[9]	梁斌, 李光辉. 基于McDiarmid界的概念漂移数据流分类算法[J]. 计算机科学与探索, 2021, 15(10): 1990-2001.
[10]	韩明明，孙广路，朱素霞. 自适应概念漂移问题的增量集成分类算法[J]. 计算机科学与探索, 2020, 14(7): 1200-1210.
[11]	包盼盼，陶传奇，黄志球. 面向开源源码大数据的数据质量研究[J]. 计算机科学与探索, 2020, 14(3): 389-400.
[12]	储传鑫，王丽珍，周丽华，李旭阳. 恶性肿瘤与工业污染之间的模糊关系挖掘[J]. 计算机科学与探索, 2020, 14(12): 2061-2071.
[13]	胡健，徐锴滨，毛伊敏. 基于加权网格和信息熵的并行密度聚类算法[J]. 计算机科学与探索, 2020, 14(12): 2094-2107.
[14]	王素琴，吴子锐. 利用LSTM网络和课程关联分类的推荐模型[J]. 计算机科学与探索, 2019, 13(8): 1380-1389.
[15]	赵一宁，肖海力. 国家高性能计算环境事件流系统的设计[J]. 计算机科学与探索, 2019, 13(3): 374-382.

大数据分类挖掘算法及其概念漂移应用研究

Research on Classification Algorithm and Concept Drift Based on Big Data

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics