类不平衡数据的EM聚类过采样算法

doi:10.3778/j.issn.1673-9418.2104080

摘要/Abstract

摘要： 针对分类任务中的不平衡数据集造成的分类性能低下的问题，提出了类不平衡数据的EM聚类过采样算法，通过过采样提高少数类样本数量，从根本上解决数据不平衡问题。首先，算法采用聚类技术，通过欧式距离衡量样本间的相似度，选取每个聚类簇的中心点作为过采样点，一定程度解决了样本的重要程度不够的问题；其次，通过直接在少数类样本空间上进行采样，可较好解决SMOTE、Cluster-SMOTE等方法对聚类空间没有针对性的问题；同时，通过对少数类样本数量的30%进行过采样，有效解决基于Cluster聚类的欠采样盲目追求两类样本数量平衡和SMOTE等算法没有明确采样率的问题。在公开的24个类不平衡数据集上进行了实验，验证了方法的有效性。

关键词: 分类任务, 不平衡数据集, 类不平衡, 过采样, 聚类

Abstract: Considering the problem of low classification performance caused by imbalanced dataset in the classification task, an EM (expectation-maximization) clustering oversampling algorithm for imbalanced data is proposed, which can solve the problem of imbalanced data fundamentally by increasing the number of samples of a few classes through oversampling. Firstly, the clustering technology is adopted to measure the similarity between samples by Euclidean distance, and the center point of each cluster is selected as the oversampling point, which solves the problem of insufficient importance of samples to some extent. Secondly, the problem that SMOTE, Cluster-SMOTE and other methods have no pertinence in clustering space can be solved by sampling in a few sample spaces directly. At the same time, through over-sampling 30% of the number of samples of a few categories, the problems that undersampling based on Cluster clustering blindly pursues the balance of the number of samples of two categories and SMOTE and other algorithms do not have clear sampling rate are effectively solved. Experi-ments on 24 public datasets with class imbalance are carried out to verify the effectiveness of the proposed method.

Key words: classification task, imbalanced dataset, class imbalanced, oversampling, clustering

谢子鹏, 包崇明, 周丽华, 王崇云, 孔兵. 类不平衡数据的EM聚类过采样算法[J]. 计算机科学与探索, 2023, 17(1): 228-237.

XIE Zipeng, BAO Chongming, ZHOU Lihua, WANG Chongyun, KONG Bing. EM Clustering Oversampling Algorithm for Class Imbalanced Data[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 228-237.

参考文献

[1] LIU Y H, CHEN Y T. Total margin based adaptive fuzzy support vector machines for multiview face recognition[C]//Proceedings of the 2005 International Conference on Sys-tems, Man and Cybernetics, Waikoloa, Oct 10-12, 2005. Pis-cataway: IEEE, 2005: 1704-1711.
[2] HUANG Y M, HUNG C M, JIAU H C. Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem[J]. Nonlinear Analysis Real World Applications, 2006, 7(4): 720-747.
[3] CIESLAK D A, CHAWLA N V, STRIEGEL A. Combating imbalance in network intrusion datasets[C]//Proceedings of the 2006 International Conference on Granular Computing, Atlanta, May 10-12, 2006. Piscataway: IEEE, 2006: 732-737.
[4] LU W Z, WANG D. Ground-level ozone prediction by sup-port vector machine approach with a cost-sensitive classi-fication scheme[J]. Science of the Total Environment, 2008, 395(2/3): 109-116.
[5] MAZUROWSKI M A, HABAS P A, ZURADA J M, et al. Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance[J]. Neural Networks, 2008, 21(2/3): 427-436.
[6] KILIC K, UNCU ?, TüRKSEN B I. Comparison of diffe-rent strategies of utilizing fuzzy clustering in structure iden-tification[J]. Information Sciences, 2007, 177(23): 5153-5162.
[7] CELEBI M E, KINGRAVI H A, UDDIN B, et al. A method-ological approach to the classification of dermoscopy images[J]. Computerized Medical Imaging and Graphics, 2007, 31(6): 362-373.
[8] TRABELSI A, ELOUEDI Z, LEFEVRE E. Decision tree classifiers for evidential attribute values and class labels[J]. Fuzzy Sets and Systems, 2019, 366: 46-62.
[9] TRAJDOS P, BURDUK R. Linear classifier combination via multiple potential functions[J]. Pattern Recognition,2021, 111: 107681.
[10] RISH I. An empirical study of the naive Bayes classifier[J]. Journal of Universal Computer Science, 2001, 1(2): 127.
[11] SUN W, QIAO X, CHENG A G. K nearest neighbor class-ifier[J]. Studies in Computational Intelligence, 2015: 127-145.
[12] SAUNDERS C, STITSON M O, WESTON J, et al. Support vector machine[J]. Computer Science, 2002, 1(4): 1-28.
[13] GARCIA V, MOLLINEDA R A, SANCHEZ J S. On the kNN performance in a challenging scenario of imbalance and overlapping[J]. Pattern Analysis & Applications, 2008, 11(3/4): 269-280.
[14] JAPKOWICZ N, STEPHEN S. The class imbalance problem: a systematic study[J]. Intelligent Data Analysis, 2002, 6(5): 429-449.
[15] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Ar-tificial Intelligence Research, 2002, 16(1): 321-357.
[16] BATISTA G E A P A, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data[J]. SIGKDD Explorations, 2004, 6(1): 20-29.
[17] STEFANOWSKI J, WILK S. Selective pre-processing of imbalanced data for improving classification performance[C]//LNCS 5182: Proceedings of the 10th International Conference on Data Warehousing and Knowledge Disco-very, Turin, Sep 2-5, 2008. Berlin, Heidelberg: Springer, 2008: 283-292.
[18] BARUA S, ISLAM M M, YAO X, et al. MWMOTE——majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge & Data Engineering, 2013, 26(2): 405-425.
[19] YU L, ZHOU R T, TANG L, et al. A DBN-based resam-pling SVM ensemble learning paradigm for credit classi-fication with imbalanced data[J]. Applied Soft Computing, 2018, 69: 192-202.
[20] HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]//LNCS 3644: Proceedings of the 2005 International Conference on Intelligent Computing, Hefei, Aug 23-26, 2005. Berlin, Heidelberg: Springer, 2005: 878-887.
[21] VAN HULSE J , KHOSHGOFTAAR T M, NAPOLITANO A. An empirical evaluation of repetitive undersampling techniques[J]. International Journal of Software Engine-ering & Knowledge Engineering, 2010, 20(2): 173-195.
[22] AL-ASHWAL R, RAWASHDEH J, ABDULLAH M. Ma-chine learning with oversampling and undersampling tech-niques: overview study and experimental results[C]//Pro-ceedings of the 2020 International Conference on Information and Communication Systems, Irbid, Apr 7-9, 2020. Pisca-taway: IEEE, 2020: 243-248.
[23] BRADLEY P. The use of the area under the ROC curve in the evaluation of machine learning algorithms[J]. Pattern Recognition, 1997, 30(7): 1145-1159.
[24] GEORGIOS D, FERNANDO B, FELIX L. Improving im-balanced learning through a heuristic oversampling method based on k-means and SMOTE[J]. Information Sciences, 2018, 465: 1-20.
[25] HU F, LI H. A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE[J]. Mathematical Problems in Engineering, 2013: 694809.
[26] LIN W C, TSAI C F, HU Y H, et al. Clustering-based under-sampling in class-imbalanced data[J]. Information Sciences, 2017, 409: 17-26.
[27] ORRIOLS-PUIG A, BERNADó-MANSILLA E. Evolutionary rule-based systems for imbalanced data sets[J]. Soft Com-puting, 2009, 13(3): 213-225.
[28] LIU X Y, WU J X, ZHOU Z H. Exploratory undersampling for class-imbalance learning[J]. IEEE Transactions on Sys-tems, Man, and Cybernetics, Part B, 2009, 39(2): 539-550.
[29] GALAR M, FERNáNDEZ A, TARTAS E B, et al. A rev-iew on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches[J]. IEEE Transac-tions on Systems, Man, and Cybernetics, Part C: Applica-tions and Reviews, 2012, 42(4): 463-484.
[30] DEMPSTER A P. Maximum likelihood from incomplete data via the EM algorithm[J]. Journal of the Royal Stati-stical Society, 1977, 39.
[31] SU C T, HSIAO Y H. An evaluation of the robustness of MTS for imbalanced data[J]. IEEE Transactions on Know-ledge & Data Engineering, 2007, 19(10): 1321-1332.
[32] DROWN D J, KHOSHGOFTAAR T M, SELIYA N. Evolu-tionary sampling and software quality modeling of high-assurance systems[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems & Humans, 2009, 39(5): 1097-1107.