计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (2): 342-349.DOI: 10.3778/j.issn.1673-9418.1804041

• 理论与算法 • 上一篇    下一篇

基于混合采样的非平衡数据分类算法

吴艺凡1,2,梁吉业1,2+,王俊红1,2   

  1. 1. 山西大学 计算机与信息技术学院,太原 030006
    2. 山西大学 计算智能与中文信息处理教育部重点实验室,太原 030006
  • 出版日期:2019-02-01 发布日期:2019-01-25

Classification Algorithm Based on Hybrid Sampling for Unbalanced Data

WU Yifan1,2, LIANG Jiye1,2+, WANG Junhong1,2   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China
    2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China
  • Online:2019-02-01 Published:2019-01-25

摘要: 过采样和欠采样方法是处理非平衡数据集分类的常用方法,但使用单一的采样算法可能造成少数类样本过拟合或者丢失含有重要信息的样本。提出了基于分类超平面的混合采样算法SVM_HS(hybrid sampling algorithm based on support vector machine),旨在克服SVM算法在处理非平衡数据时分类超平面容易偏向少数类样本的问题。该算法首先利用SVM算法得到分类超平面。然后迭代进行混合采样,主要包括:(1)删除离分类超平面较远的一些多数类样本;(2)对靠近真实类边界的少数类样本用SMOTE(synthetic minority over-sampling technique)过采样,使分类超平面向着真实类边界方向偏移。实验结果表明相比其他相关算法,该算法的F-value值和G-mean值均有较大提高。

关键词: 非平衡, 支持向量机(SVM), 少数类样本过采样技术(SMOTE), 分类超平面, 混合采样

Abstract: The over-sampling and under-sampling that are used to classify unbalanced datasets are common methods. However, the two methods either lead to over-fitting or lose important samples. Based on this insight, a hybrid sampling algorithm SVM_HS (hybrid sampling algorithm based on support vector machine) based on the classification hyperplane is proposed, to solve the problem of the classification hyperplane of the SVM algorithm moving to the minority class easily. Firstly, the algorithm uses the SVM algorithm to obtain the classification hyperplane, then deletes some samples in the majority class that are far away from the hyperplane and generates some new samples belonging to the minority class near the real boundary with the SMOTE (synthetic minority over-sampling technique) iteratively, finally makes the classification hyperplane closer to the real boundary slowly. Compared with other resampling methods, experimental results have shown that the F-value and G-mean of the proposed algorithm are improved.

Key words: imbalance, support vector machine (SVM), synthetic minority over-sampling technique (SMOTE), classification hyperplane, hybrid sampling