计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (1): 228-237.DOI: 10.3778/j.issn.1673-9418.2104080

• 人工智能·模式识别 • 上一篇    下一篇

类不平衡数据的EM聚类过采样算法

谢子鹏,包崇明,周丽华,王崇云,孔兵   

  1. 1. 云南大学 信息学院,昆明 650504
    2. 云南大学 软件学院,昆明 650504
    3. 云南大学 生态学与环境学院,昆明 650504
  • 出版日期:2023-01-01 发布日期:2023-01-01

EM Clustering Oversampling Algorithm for Class Imbalanced Data

XIE Zipeng, BAO Chongming, ZHOU Lihua, WANG Chongyun, KONG Bing   

  1. 1. School of Information, Yunnan University, Kunming 650504, China
    2. School of Software, Yunnan University, Kunming 650504, China 
    3. School of Ecology and Environmental Science, Yunnan University, Kunming 650504, China
  • Online:2023-01-01 Published:2023-01-01

摘要: 针对分类任务中的不平衡数据集造成的分类性能低下的问题,提出了类不平衡数据的EM聚类过采样算法,通过过采样提高少数类样本数量,从根本上解决数据不平衡问题。首先,算法采用聚类技术,通过欧式距离衡量样本间的相似度,选取每个聚类簇的中心点作为过采样点,一定程度解决了样本的重要程度不够的问题;其次,通过直接在少数类样本空间上进行采样,可较好解决SMOTE、Cluster-SMOTE等方法对聚类空间没有针对性的问题;同时,通过对少数类样本数量的30%进行过采样,有效解决基于Cluster聚类的欠采样盲目追求两类样本数量平衡和SMOTE等算法没有明确采样率的问题。在公开的24个类不平衡数据集上进行了实验,验证了方法的有效性。

关键词: 分类任务, 不平衡数据集, 类不平衡, 过采样, 聚类

Abstract: Considering the problem of low classification performance caused by imbalanced dataset in the classification task, an EM (expectation-maximization) clustering oversampling algorithm for imbalanced data is proposed, which can solve the problem of imbalanced data fundamentally by increasing the number of samples of a few classes through oversampling. Firstly, the clustering technology is adopted to measure the similarity between samples by Euclidean distance, and the center point of each cluster is selected as the oversampling point, which solves the problem of insufficient importance of samples to some extent. Secondly, the problem that SMOTE, Cluster-SMOTE and other methods have no pertinence in clustering space can be solved by sampling in a few sample spaces directly. At the same time, through over-sampling 30% of the number of samples of a few categories, the problems that undersampling based on Cluster clustering blindly pursues the balance of the number of samples of two categories and SMOTE and other algorithms do not have clear sampling rate are effectively solved. Experi-ments on 24 public datasets with class imbalance are carried out to verify the effectiveness of the proposed method.

Key words: classification task, imbalanced dataset, class imbalanced, oversampling, clustering