基于密度峰值快速聚类算法的合成过采样方法

doi:10.3778/j.issn.1673-9418.2411043

摘要/Abstract

摘要： 类不平衡问题，作为分类任务中的一大挑战，源于训练数据集中多数类与少数类样本数量的显著失衡。这种不平衡性不仅影响分类器的泛化能力，还可能导致对少数类样本的识别精度大幅下降。过采样技术，尤其是SMOTE(合成过采样技术)及其变种方法，作为缓解此类问题的有效手段，通过生成额外的少数类样本来平衡数据集。然而，这些方法存在生成样本可能引入噪声、样本多样性不足以及未能充分关注边界区域等局限性。鉴于边界样本在分类决策中的关键作用及其易受分类器误判的特性，本文提出了一种创新的过采样策略，旨在精准识别边界样本，并在其周围生成高质量的新样本。该方法首先采用CFSFDP(密度峰值快速聚类)算法，凭借其识别局部密度峰值的能力，计算出每个少数类样本的局部密度，进而筛选出位于分类边界样本。随后，通过计算这些边界样本与其最近多数类样本之间的欧式距离，为每个边界样本定义一个合适的球形区域，该区域既涵盖了边界样本的潜在分布范围，又避免了与多数类样本的过度重叠。在确定了边界样本及其对应的球形区域后，本方法在该区域内随机生成新的合成样本。这一步骤不仅增加了少数类样本的多样性，还使得生成的样本更加贴近真实的边界分布，从而有助于分类器更好地学习少数类的复杂特征。为验证本文方法的有效性，我们将其与现有的九种过采样方法在32个真实世界的不平衡数据集上进行了全面比较。实验结果表明，本文提出的方法在多个评价指标上均表现出色。

关键词: 不平衡数据, CFSFDP聚类算法, 合成过采样, 边界样本

Abstract: As a major challenge in the classification task, the class imbalance problem stems from the significant imbalance between the number of majority and minority samples in the training dataset. This imbalance not only affects the generalization ability of the classifier, but also may lead to a significant decrease in the recognition accuracy of minority samples. Oversampling techniques, especially SMOTE and its variants, serve as an effective means of mitigating such problems by generating additional minority samples to balance the dataset. However, these methods have limitations such as the potential introduction of noise in sample generation, insufficient sample diversity, and insufficient attention to boundary regions. In view of the key role of boundary samples in classification decision-making and their susceptibility to classifier misjudgment, this paper proposes an innovative oversampling strategy to accurately identify boundary samples and generate high-quality new samples around them. Firstly, the CFSFDP clustering algorithm is used to calculate the local density of each minority sample by virtue of its ability to identify the local density peak, and then the samples located at the classification boundary are screened out. Subsequently, by calculating the Euclidean distance between these boundary samples and their nearest majority class samples, a suitable spherical region is defined for each boundary sample, which not only covers the potential distribution range of the boundary samples, but also avoids excessive overlap with the majority class samples. After determining the boundary sample and its corresponding spherical region, a new synthetic sample is randomly generated in the region. This step not only increases the diversity of minority samples, but also makes the generated samples closer to the real boundary distribution, which helps the classifier to better learn the complex features of minority classes. To verify the effectiveness of the proposed method, we comprehensively compare it with nine existing oversampling methods on 32 real-world imbalance datasets. Experimental results show that the proposed method performs well in multiple evaluation indexes.

Key words: unbalanced data, CFSFDP clustering algorithm, synthetic oversampling, boundary samples

冷强奎, 李梓涵. 基于密度峰值快速聚类算法的合成过采样方法[J]. 计算机科学与探索, DOI: 10.3778/j.issn.1673-9418.2411043.

LENG Qiangkui, LI Zihan. A synthetic oversampling method based on fast clustering algorithm for density peaks[J]. Journal of Frontiers of Computer Science and Technology, DOI: 10.3778/j.issn.1673-9418.2411043.

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	23	0	0

来源	本网站	其他网站

次数	22	1
比例	96%	4%

摘要

最新录用	在线预览	正式出版

37	0	0

	来源	本网站

	次数	36
	比例	100%

[1]	姜钰棋, 侯智文, 王一帆, 翟晗名, 卜凡亮. 社交平台不平衡文本数据处理与应用研究[J]. 计算机科学与探索, 2024, 18(9): 2370-2383.
[2]	王晓霞, 李雷孝, 林浩. SMOTE类算法研究综述[J]. 计算机科学与探索, 2024, 18(5): 1135-1159.
[3]	周晶雨, 王士同. 对不平衡数据的多源在线迁移学习算法[J]. 计算机科学与探索, 2023, 17(3): 687-700.
[4]	谢子鹏, 包崇明, 周丽华, 王崇云, 孔兵. 类不平衡数据的EM聚类过采样算法[J]. 计算机科学与探索, 2023, 17(1): 228-237.
[5]	张壮, 王士同. 不平衡数据的Takagi-Sugeno-Kang模糊分类集成模型[J]. 计算机科学与探索, 2022, 16(6): 1374-1382.
[6]	严远亭，朱原玮，吴增宝，张以文，张燕平. 构造性覆盖算法的SMOTE过采样方法[J]. 计算机科学与探索, 2020, 14(6): 975-984.
[7]	商显震，韩萌，孙毓忠，孙宇宁，陈旭，胡满满，梅御东. 融合生成对抗网络和朴素贝叶斯皮肤病诊断方法[J]. 计算机科学与探索, 2019, 13(6): 1005-1015.
[8]	么素素，王宝亮，侯永宏. 绝对不平衡样本分类的集成迁移学习算法[J]. 计算机科学与探索, 2018, 12(7): 1145-1153.
[9]	王超学，张涛，马春森. 面向不平衡数据集的改进型SMOTE算法[J]. 计算机科学与探索, 2014, 8(6): 727-734.

基于密度峰值快速聚类算法的合成过采样方法

A synthetic oversampling method based on fast clustering algorithm for density peaks

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 9

编辑推荐 0

Metrics