计算机科学与探索

• 学术研究 •    下一篇

基于密度峰值快速聚类算法的合成过采样方法

冷强奎,李梓涵   

  1. 辽宁工程技术大学 电子与信息工程学院,辽宁 葫芦岛 125105

A synthetic oversampling method based on fast clustering algorithm for density peaks

LENG Qiangkui,  LI Zihan   

  1. School of Electronic and Information Engineering, Liaoning Technical University, Huludao, Liaoning 125105

摘要: 类不平衡问题,作为分类任务中的一大挑战,源于训练数据集中多数类与少数类样本数量的显著失衡。这种不平衡性不仅影响分类器的泛化能力,还可能导致对少数类样本的识别精度大幅下降。过采样技术,尤其是SMOTE(合成过采样技术)及其变种方法,作为缓解此类问题的有效手段,通过生成额外的少数类样本来平衡数据集。然而,这些方法存在生成样本可能引入噪声、样本多样性不足以及未能充分关注边界区域等局限性。鉴于边界样本在分类决策中的关键作用及其易受分类器误判的特性,本文提出了一种创新的过采样策略,旨在精准识别边界样本,并在其周围生成高质量的新样本。该方法首先采用CFSFDP(密度峰值快速聚类)算法,凭借其识别局部密度峰值的能力,计算出每个少数类样本的局部密度,进而筛选出位于分类边界样本。随后,通过计算这些边界样本与其最近多数类样本之间的欧式距离,为每个边界样本定义一个合适的球形区域,该区域既涵盖了边界样本的潜在分布范围,又避免了与多数类样本的过度重叠。在确定了边界样本及其对应的球形区域后,本方法在该区域内随机生成新的合成样本。这一步骤不仅增加了少数类样本的多样性,还使得生成的样本更加贴近真实的边界分布,从而有助于分类器更好地学习少数类的复杂特征。为验证本文方法的有效性,我们将其与现有的九种过采样方法在32个真实世界的不平衡数据集上进行了全面比较。实验结果表明,本文提出的方法在多个评价指标上均表现出色。

关键词: 不平衡数据, CFSFDP聚类算法, 合成过采样, 边界样本

Abstract: As a major challenge in the classification task, the class imbalance problem stems from the significant imbalance between the number of majority and minority samples in the training dataset. This imbalance not only affects the generalization ability of the classifier, but also may lead to a significant decrease in the recognition accuracy of minority samples. Oversampling techniques, especially SMOTE and its variants, serve as an effective means of mitigating such problems by generating additional minority samples to balance the dataset. However, these methods have limitations such as the potential introduction of noise in sample generation, insufficient sample diversity, and insufficient attention to boundary regions. In view of the key role of boundary samples in classification decision-making and their susceptibility to classifier misjudgment, this paper proposes an innovative oversampling strategy to accurately identify boundary samples and generate high-quality new samples around them. Firstly, the CFSFDP clustering algorithm is used to calculate the local density of each minority sample by virtue of its ability to identify the local density peak, and then the samples located at the classification boundary are screened out. Subsequently, by calculating the Euclidean distance between these boundary samples and their nearest majority class samples, a suitable spherical region is defined for each boundary sample, which not only covers the potential distribution range of the boundary samples, but also avoids excessive overlap with the majority class samples. After determining the boundary sample and its corresponding spherical region, a new synthetic sample is randomly generated in the region. This step not only increases the diversity of minority samples, but also makes the generated samples closer to the real boundary distribution, which helps the classifier to better learn the complex features of minority classes. To verify the effectiveness of the proposed method, we comprehensively compare it with nine existing oversampling methods on 32 real-world imbalance datasets. Experimental results show that the proposed method performs well in multiple evaluation indexes.

Key words: unbalanced data, CFSFDP clustering algorithm, synthetic oversampling, boundary samples