Concept Drift Data Stream Classification Algorithm Based on McDiarmid Bound

doi:10.3778/j.issn.1673-9418.2006100

Abstract

Abstract:

Concept drift in data streams can cause significant performance degradation of existing classification models. Most current data stream algorithms for concept drift only aim at a certain type of concept drift (such as abrupt, gradual, or recurring drift), which is difficult to adapt to different scenarios. Therefore, this paper proposes a new data stream algorithm suitable for different types of concept drift. The proposed algorithm saves the latest classification results through a two-layer window, assigns weights to it based on the membership function and calculates the weighted error rate. Then the McDiarmid bound is used to analyze the difference [δ] between the error rates of current window and the past window, and the concept drift is detected according to the significance of[δ]. After detecting drift, the semi-parametric log-likelihood algorithm is used to check whether the current new concept is a recurrence of the past concept, and then whether to reuse the old classifier is decided. Experimental results show that, the proposed algorithm outperforms the similar existing algorithms in terms of average detecting delay, false positive rate, classification accuracy and running time.

Key words: concept drift, membership degree, double fuzzy window, McDiarmid bound, recurring concept

摘要：

数据流中的概念漂移会导致已有的分类模型性能显著下降。目前处理概念漂移的数据流分类算法大都只针对单一类型的概念漂移（如突变型、渐变型或重复型等），难以同时适应不同场景。为此，提出了一种新的适于多类型概念漂移的数据流分类算法。该算法通过双层窗口保存当前最新的分类结果，根据模糊集隶属度函数对窗口中数据分配权重并计算加权错误率，然后利用McDiarmid界分析当前窗口和过去窗口内错误率的差异[δ]，根据[δ]是否具有显著性检测概念漂移。检测到漂移后，使用半参数对数似然算法检验当前概念是否为过去概念的重现，进而决定是否复用旧分类器。实验结果表明，与以往同类算法相比，所提算法在漂移检测延迟、误报率、分类准确率和运行时间等指标上均有一定优势。

关键词: 概念漂移, 隶属度, 双层模糊窗口, McDiarmid界, 重复概念

LIANG Bin, LI Guanghui. Concept Drift Data Stream Classification Algorithm Based on McDiarmid Bound[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1990-2001.

梁斌, 李光辉. 基于McDiarmid界的概念漂移数据流分类算法[J]. 计算机科学与探索, 2021, 15(10): 1990-2001.

References

[1] DE BARROS R S M, DE CARVALHO SANTOS S G T. An overview and comprehensive comparison of ensembles for concept drift[J]. Information Fusion, 2019, 52: 213-244.
[2] LU J, LIU A J, DONG F, et al. Learning under concept drift: a review[J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(12): 2346-2363.
[3] SUN Y G, WANG Z H, YUAN J D, et al. Adaptive ensemble classification algorithm for data streams based on information entropy[J]. Journal of University of Science and Technology of China, 2017, 47(7): 575-582.
孙艳歌, 王志海, 原继东, 等. 基于信息熵的数据流自适应集成分类算法[J]. 中国科学技术大学学报, 2017, 47(7): 575-582.
[4] DE BARROS R S M, DE CARVALHO SANTOS S G T. A large-scale comparison of concept drift detectors[J]. Infor-mation Sciences, 2018, 451/452: 348-370.
[5] GUO H S, ZHANG A J, WANG W J. Concept drift detection method based on online performance test[J]. Journal of Software, 2020, 31(4): 932-947.
郭虎升, 张爱娟, 王文剑. 基于在线性能测试的概念漂移检测方法[J]. 软件学报, 2020, 31(4): 932-947.
[6] GAMA J, MEDAS P, CASTILLO G, et al. Learning with drift detection[C]//LNCS 3171: Proceedings of the 17th Brazilian Symposium on Artificial Intelligence, S?o Luis, Sep 29-Oct 1, 2004. Berlin, Heidelberg: Springer, 2004: 286-295.
[7] DE BARROS R S M, DE LIMA CABRAL D R, GON?-ALVES P M, et al. RDDM: reactive drift detection method[J]. Expert Systems with Applications, 2017, 90: 344-355.
[8] PEARS R, SAKTHITHASAN S, KOH Y S. Detecting con-cept change in dynamic data streams[J]. Machine Learning, 2014, 97(3): 259-293.
[9] FRIAS-BLANCO I I, DEL CAMPO-áVILA J, RAMOS-JIMéNEZ G, et al. Online and non-parametric drift detec-tion methods based on Hoeffding??s bounds[J]. IEEE Transac-tions on Knowledge and Data Engineering, 2015, 27(3): 810-823.
[10] DE LIMA CABRAL D R, DE BARROS R S M. Concept drift detection based on Fisher??s exact test[J]. Information Sciences, 2018, 442/443: 220-234.
[11] GON?ALVES P M, DE BARROS R S M. RCD: a recurring concept drift framework[J]. Pattern Recognition Letters, 2013, 34(9): 1018-1025.
[12] BAI Y, WANG Z H, SUN Y G. Recurring concept detection and prediction based on the graph[J]. Journal of Zhengzhou University (Engineering Science Edition), 2017, 38(4): 57-64.
白洋, 王志海, 孙艳歌. 基于图的概念重现发现与预测[J]. 郑州大学学报(工学版), 2017, 38(4): 57-64.
[13] BRZEZINSKI D, STEFANOWSKI J. Reacting to different types of concept drift: the accuracy updated ensemble algo-rithm[J]. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(1): 81-94.
[14] BRZEZINSKI D, STEFANOWSKI J. Combining block-based and online methods in learning ensembles from concept drifting data streams[J]. Information Sciences, 2014, 265: 50-67.
[15] ELWELL R, POLIKAR R. Incremental learning of concept drift in nonstationary environments[J]. IEEE Transactions on Neural Networks, 2011, 22(10): 1517-1531.
[16] XU G Y, HAN M, WANG S F, et al. Summarization of data stream ensemble classification algorithm[J]. Application Research of Computers, 2020, 37(1): 1-8.
许冠英, 韩萌, 王少峰, 等. 数据流集成分类算法综述[J]. 计算机应用研究, 2020, 37(1): 1-8.
[17] DU S Y, HAN M, SHEN M Y, et al. Survey of ensemble classification algorithms for data streams with concept drift[J]. Computer Engineering, 2020, 46(1): 15-24.
杜诗语, 韩萌, 申明尧, 等. 概念漂移数据流集成分类算法综述[J]. 计算机工程, 2020, 46(1): 15-24.
[18] LIU A J, ZHANG G Q, LU J. Fuzzy time windowing for gradual concept drift adaptation[C]//Proceedings of the 2017 IEEE International Conference on Fuzzy Systems, Naples, Jul 9-12, 2017. Piscataway: IEEE, 2017: 1-6.
[19] KUNCHEVA L I. Change detection in streaming multivariate data using likelihood detectors[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5): 1175-1180.
[20] XU Q Y, HE L, ZHU H X. Improved detection method of concept drift based on the Hoeffding inequality[J]. Computer Engineering and Applications, 2020, 56(19): 55-61.
徐清妍, 何丽, 朱泓西. 改进Hoeffding不等式的概念漂移检测方法[J]. 计算机工程与应用, 2020, 56(19): 55-61.
[21] PAN W B, CHENG G, GUO X J, et al. An adaptive class-ification approach based on information entropy for network traffic in presence of concept drift[J]. Chinese Journal of Com-puters, 2017, 40(7): 1556-1571.
潘吴斌, 程光, 郭晓军, 等. 基于信息熵的自适应网络流概念漂移分类方法[J]. 计算机学报, 2017, 40(7): 1556-1571.
[22] WARNKE L. On the method of typical bounded differences[J]. Combinatorics, Probability and Computing, 2016, 25(2): 269-299.
[23] PESARANGHADER A, VIKTOR H L, PAQUET E. Mcdiar-mid drift detection methods for evolving data streams[C]//Proceedings of the 2018 International Joint Conference on Neural Networks, Rio de Janeiro, Jul 8-13, 2018. Piscataway: IEEE, 2018: 1-9.
[24] MONTIEL J, READ J, BIFET A, et al. Scikit-multiflow: a multi-output streaming framework[J]. Journal of Machine Learning Research, 2018, 19: 72.