计算机科学与探索 ›› 2014, Vol. 8 ›› Issue (6): 727-734.DOI: 10.3778/j.issn.1673-9418.1403003

• 人工智能与模式识别 • 上一篇    下一篇

面向不平衡数据集的改进型SMOTE算法

王超学1,张  涛1+,马春森2   

  1. 1. 西安建筑科技大学 信息与控制工程学院,西安 710055
    2. 中国农业科学院 植物保护研究所,北京 100193
  • 出版日期:2014-06-01 发布日期:2014-05-30

Improved SMOTE Algorithm for Imbalanced Datasets

WANG Chaoxue1, ZHANG Tao1+, MA Chunsen2   

  1. 1. School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China
    2. China Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing 100193, China
  • Online:2014-06-01 Published:2014-05-30

摘要: 针对SMOTE(synthetic minority over-sampling technique)在合成少数类新样本时存在的不足,提出了一种改进的SMOTE算法GA-SMOTE。该算法的关键将是遗传算法中的3个基本算子引入到SMOTE中,利用选择算子实现对少数类样本有区别的选择,使用交叉、变异算子实现对合成样本质量的控制。结合GA-SMOTE与SVM(support vector machine)算法来处理不平衡数据的分类问题。UCI数据集上的大量实验表明,GA-SMOTE在新样本的整体合成效果上表现出色,有效提高了SVM在不平衡数据集上的分类性能。

关键词: 不平衡数据集, 分类, 遗传算子, 少数类样本合成过采样技术(SMOTE)

Abstract: Based on analyzing the shortages of SMOTE (synthetic minority over-sampling technique) in the synthesis of minority class samples, this paper presents an improved SMOTE (GA-SMOTE). The key of GA-SMOTE lies on leading three basic genetic operators of genetic algorithm (GA) into SMOTE, making use of the selection operator to achieve the different samples from the minority class and depending on crossover operator and mutation operator to realize the fine control of the synthesis quality to the minority class samples. GA-SMOTE and SVM (support vector machine) are combined to handle the classification problem on imbalanced datasets. A large amount of experiments on the UCI datasets show that GA-SMOTE promises prominent synthesis effect to the minority class samples, and brings better classification performance on imbalanced datasets with SVM.

Key words: imbalanced dataset, classification, genetic operator, synthetic minority over-sampling technique (SMOTE)