计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (5): 1135-1159.DOI: 10.3778/j.issn.1673-9418.2309079

• 前沿·综述 • 上一篇    下一篇

SMOTE类算法研究综述

王晓霞,李雷孝,林浩   

  1. 1. 内蒙古工业大学 数据科学与应用学院,呼和浩特 010080
    2. 内蒙古自治区基于大数据的软件服务工程技术研究中心,呼和浩特 010080
    3. 天津理工大学 计算机科学与工程学院,天津 300384
  • 出版日期:2024-05-01 发布日期:2024-04-29

Survey of Research on SMOTE Type Algorithms

WANG Xiaoxia, LI Leixiao, LIN Hao   

  1. 1. College of Data Science and Application, Inner Mongolia University of Technology, Hohhot 010080, China
    2. Inner Mongolia Autonomous Region Software Service Engineering Technology Research Center Based on Big Data, Hohhot 010080, China
    3. College of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China
  • Online:2024-05-01 Published:2024-04-29

摘要: 合成少数类过采样技术(SMOTE)因能有效处理少数类样本已成为处理不平衡数据的主流方法之一,而且许多SMOTE改进算法已被提出,但目前已有的调研极少考虑到流行的算法级改进方法。因此对现有SMOTE类算法进行更全面的分析与总结。首先详细阐述了SMOTE方法的基本原理,然后主要从数据级、算法级两个层面系统性地梳理分析SMOTE类算法,并介绍数据级和算法级混合改进的新思路。数据级改进是在预处理时通过不同操作删除或添加数据来平衡数据分布;算法级改进不会改变数据分布,主要通过修改或创建算法来加强对少数类样本的关注度。二者相比,数据级方法应用受限更少,算法级改进的算法鲁棒性普遍更高。为了更全面地提供SMOTE类算法的基础研究材料,最后列出常用数据集、评价指标,给出未来可能尝试进行的研究思路,以更好地应对不平衡数据问题。

关键词: 不平衡数据, 合成少数类过采样技术(SMOTE), 过采样, 监督学习

Abstract: Synthetic minority oversampling technique (SMOTE) has become one of the mainstream methods for dealing with unbalanced data due to its ability to effectively deal with minority samples, and many SMOTE improvement algorithms have been proposed, but very little research existing considers popular algorithmic-level improvement methods. Therefore a more comprehensive analysis of existing SMOTE class algorithms is provided. Firstly, the basic principles of the SMOTE method are elaborated in detail, and then the SMOTE class algorithms are systematically analyzed mainly from the two levels of data level and algorithmic level, and the new ideas of the hybrid improvement of data level and algorithmic level are introduced. Data-level improvement is to balance the data distribution by deleting or adding data through different operations during preprocessing; algorithmic-level improvement will not change the data distribution, and mainly strengthens the focus on minority samples by modifying or creating algorithms. Comparison between these two kinds of methods shows that, data-level methods are less restricted in their application, and algorithmic-level improvements generally have higher algorithmic robustness. In order to provide more comprehensive basic research material on SMOTE class algorithms, this paper finally lists the commonly used datasets, evaluation metrics, and gives ideas of research in the future to better cope with unbalanced data problem.

Key words: unbalanced data, synthetic minority oversampling technique (SMOTE), oversampling, supervised learning