计算机科学与探索

• 学术研究 •    下一篇

信用风险不平衡数据的表格生成对抗网络优化与分类

王轶群,王笑,高燕程   

  1. 甘肃政法大学 人工智能学院,兰州  730000

Table Generation Adversarial Network Optimization and Classification of Credit Risk Imbalance Data

WANG Yiqun,  WANG Xiao,  GAO Yancheng   

  1. School of Artificial Intelligence,Gansu University of Political Science and Law, Lanzhou 730000, China

摘要: 人工智能在信用风险评估中能有效识别风险并提升决策效率,然而,现有信用风险数据普遍存在类别不平衡问题,导致模型在预测时偏向多数类,影响评估的准确性和可靠性。针对数据不平衡问题,提出一种融合变分自编码器(VAE)和条件表格生成对抗网络(CTGAN)的混合生成模型(VCTGAN),用于合成高质量平衡数据集。首先,通过VAE中的隐变量学习真实数据的关键特征和潜在分布,生成结构化隐变量作为原始CTGAN的输入;然后,在数据生成器中引入自注意力机制用于更好的捕捉不平衡数据的突出特征;最后,在判别器中加入对比损失模块来增强生成数据的类别间差异,达到提高生成数据质量的目的。通过在Taiwan Credit和Give Me Some Credit两个基准数据集上的系统实验验证,分别取得了89.91%和96.89%的最佳分类准确率,结果表明这种改进方法在处理信用数据不平衡方面明显优于传统方法。消融实验进一步验证了各组件对性能的贡献,证实了所提方法的合理性和有效性。它不仅生成高质量的平衡数据集,而且提高模型识别少数类别的能力,为解决金融领域的数据不平衡问题提供了新的技术方案。

关键词: CTGAN, 生成模型, 不平衡数据集, 机器学习, 信用风险评估

Abstract: Artificial intelligence can effectively identify risks and improve decision-making efficiency in credit risk assessment; however, the existing credit risk data generally suffer from the category imbalance problem, which causes the model to be biased toward the majority of categories in prediction and affects the accuracy and reliability of assessment. To address the data imbalance problem, a hybrid generative model (VCTGAN) incorporating variational autoencoder (VAE) and conditional table generative adversarial network (CTGAN) is proposed for synthesizing highly balanced datasets. First, the key features and potential distributions of real data are learned through the hidden variables in VAE to generate structured hidden variables as inputs to the original CTGAN; then, a self-attention mechanism is introduced into the data generator for better capturing the salient features of the imbalanced data; and finally, a contrast loss module is added into the discriminator to enhance the inter-category differences of the generated data for the purpose of improving the generated data. Through systematic experimental validation on two benchmark datasets, Taiwan Credit and Give Me Some Credit, 89.91% and 96.89% classification accuracies are achieved, respectively, and the results show that this improvement is significantly better in dealing with credit data imbalance. The ablation experiments further validate the contribution of each component to the performance and confirm the rationality and effectiveness of the proposed. It not only generates high balanced datasets, but also improves the model's ability to recognize a few categories, which provides a new technical solution to solve the data imbalance problem in the financial field.

Key words: CTGAN, generative modeling, unbalanced datasets, machine learning, credit risk assessment