Ensemble Classification Algorithm for Imbalanced Data Combined with Local Area Density

doi:10.3778/j.issn.1673-9418.1901017

Abstract

Abstract:

Oversampling method is one of the effective ways to deal with imbalance classification problems. This paper focuses on the problems of reducing sampling result faced by the oversampling methods based on SMOTE(synthetic minority oversampling technique) in the occurrence of class-overlapping and small-disjuncts in dataset. An oversampling method MOLAD based on the local area density is proposed. Furthermore, a method LADBMOTE which combines the MOLAD and Bagging-based ensemble learning in sampling stage is proposed in order to solve the classification problem for imbalanced dataset. LADBMOTE first calculates the [K] nearest neighbors of each minority class sample according to MOLAD, and then selects all [K] nearest neighbors for sampling, thus [K] balanced datasets will be generated. Then, the Bagging-based ensemble learning is used to ensemble classifiers obtained from training [K] balanced datasets. The proposed method LADBMOTE is compared with 7 currently popular algorithms for handling imbalanced data by employing 20 imbalanced datasets published on KEEL. The experimental results show that the classification performance of LADBMOTE on different classifiers is better and more robust.

Key words: imbalanced data, strategy for calculating nearest neighbors, ensemble learning, oversampling

摘要：

传统的过采样方法是解决非平衡数据分类问题的有效方法之一。基于SMOTE的过采样方法在数据集出现类别重叠（class-overlapping）和小析取项（small-disjuncts）问题时将降低采样的效果，针对该问题提出了一种基于样本局部密度的过采样算法MOLAD。在此基础上，为了解决非平衡数据的分类问题，提出了一种在采样阶段将MOLAD算法和基于Bagging的集成学习结合的算法LADBMOTE。LADBMOTE首先根据MOLAD计算每个少数类样本的K近邻，然后选择所有的[K]近邻进行采样，生成[K]个平衡数据集，最后利用基于Bagging的集成学习方法将[K]个平衡数据集训练得到的分类器集成。在KEEL公开的20个非平衡数据集上，将提出的LADBMOTE算法与当前流行的7个处理非平衡数据的算法对比，实验结果表明LADBMOTE在不同的分类器上的分类性能更好，鲁棒性更强。

关键词: 非平衡数据, 近邻计算策略, 集成学习, 过采样

YANG Hao, CHEN Hongmei. Ensemble Classification Algorithm for Imbalanced Data Combined with Local Area Density[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(2): 274-284.

杨浩，陈红梅. 结合样本局部密度的非平衡数据集成分类算法[J]. 计算机科学与探索, 2020, 14(2): 274-284.

[1]	WANG Tianhao, ZHANG Pei, ZHANG Zhao, CHEN Xihai, WANG Jing, ZHANG Baili. Multi-label Classification Based on Resampling and Ensemble Learning [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(4): 892-901.
[2]	ZHOU Jingyu, WANG Shitong. Multi-source Online Transfer Learning Algorithm for Imbalanced Data [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(3): 687-700.
[3]	ZHAO Min, ZHANG Yueqin, DOU Yingtong, ZHANG Zehua. Imbalanced Fake Reviews?Detection with Ensemble Hierarchical Graph Attention Network [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(2): 428-441.
[4]	XIE Zipeng, BAO Chongming, ZHOU Lihua, WANG Chongyun, KONG Bing. EM Clustering Oversampling Algorithm for Class Imbalanced Data [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 228-237.
[5]	CHEN Yang, WANG Shitong. Ensemble Method of Diverse Regularized Extreme Learning Machines [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1819-1928.
[6]	SHEN Ruicai, ZHAI Junhai, HOU Yingzhen. Multi-discriminator Generative Adversarial Networks Based on Selective Ensemble Learning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1429-1438.
[7]	ZHANG Zhuang, WANG Shitong. Ensemble Model of Takagi-Sugeno-Kang Fuzzy Classifiers for Imbalanced Data [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1374-1382.
[8]	HUANG Yuxiang, HUANG Dong, WANG Changdong, LAI Jianhuang. Improved Deep Embedding Clustering with Ensemble Learning [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1949-1957.
[9]	SUN Wei, ZHANG Yu. Intranet Anomaly Detection Method Using Flow Mining and Graph Mining [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(7): 1154-1163.
[10]	YAN Yuanting, ZHU Yuanwei, WU Zengbao, ZHANG Yiwen, ZHANG Yanping. Constructive Covering Algorithm-Based SMOTE Over-sampling Method [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(6): 975-984.
[11]	DONG Minggang, LIU Ming, JING Chao. Sampling Safety Coefficient for Multi-class Imbalance Oversampling Algorithm [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(10): 1776-1786.
[12]	SHANG Xianzhen, HAN Meng, SUN Yuzhong, SUN Yuning, CHEN Xu, HU Manman, MEI Yudong. Skin Diseases Diagnosis Method Based on Generative Adversarial Networks and Naive Bayes [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(6): 1005-1015.
[13]	DING Yi, WANG Mingliang, ZHANG Daoqiang. Diverse Random Subspace Ensemble [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(9): 1434-1443.
[14]	YAO Susu, WANG Baoliang, HOU Yonghong. Ensemble Transfer Learning Algorithm for Absolute Imbalanced Data Classification [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(7): 1145-1153.
[15]	XU Ouyang, LI Guanghui. Anomaly Data Detection Using Glowworm Optimization and Random Forest in Wireless Sensor Networks [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(10): 1633-1644.

Ensemble Classification Algorithm for Imbalanced Data Combined with Local Area Density

结合样本局部密度的非平衡数据集成分类算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics