计算机科学与探索

• 学术研究 •    下一篇

基于图潜向量分布学习的图过采样方法

任博, 董明刚, 于扬, 卢贤睿   

  1. 1. 桂林理工大学 物理与电子信息工程学院,广西 桂林 541006
    2. 广西嵌入式技术与智能系统重点实验室,广西 桂林 541006
    3. 桂林理工大学 计算机科学与工程学院,广西 桂林 541006

Graph Oversampling Method Based on Graph Latent Representation Distribution Learning

REN Bo,  DONG Minggang,  YU Yang,  LU Xianrui   

  1. 1. College of Physics and Electronic Information Engineering, Guilin University of Technology, Guilin, Guangxi 541006, China
    2. Guangxi Key Laboratory of Embedded Technology and Intelligent Systems, Guilin, Guangxi 541006, China
    3. College of Computer Science and Engineering, Guilin University of Technology, Guilin, Guangxi 541006, China

摘要: 现实世界中许多图数据存在类别分布不平衡的问题,其通常表现在节点、边和图三个级别。常用的基于过采样的图级不平衡处理方法,因样本缺乏多样性,会导致模型过拟合。针对该问题,本文提出一种图潜向量分布学习的图过采样方法GLRD-GAN。首先,提出一种图潜向量分布学习方法,利用预训练的图变分自编码器(VGAE)和全连接神经网络学习少数类图样本在低维空间内的潜向量分布,在该分布上随机采样潜向量信息并与原少数类潜向量融合,保证了少数类潜向量的多样性。其次,设计了一种基于双解码器的图样本生成器,经预训练的内积解码器和图卷积解码器充分利用采样的潜向量来分别生成图数据的拓扑结构和节点特征。最后,通过GAN判别器检测生成样本的真伪和类别,监督生成样本的有效性,实现多样性的少数类图样本生成。在5个具有代表性的长尾图数据集上进行了对比实验和可视化观察,结果表明本文提出的基于图潜向量分布学习的图过采样方法在Acc和F1值上较其他方法平均高出1%-4%,且能够生成有效的少数类图样本。

关键词: 长尾问题, 图变分自编码器, 图潜向量, 生成对抗网络

Abstract: In the real world, many graph datasets suffer from class imbalance issues, typically manifesting at the node, edge, and graph levels. Common oversampling-based methods for addressing graph-level imbalance often lead to model overfitting due to a lack of sample diversity. To address this issue, a graph latent representation distribution learning-based graph oversampling method called GLRD-GAN is proposed. First, a graph latent representation distribution learning method is introduced, utilizing a pre-trained Variational graph auto- encoder(VGAE) and a fully connected neural network to learn the latent representation distribution of minority class graph samples in the low-dimensional space. The latent representation information is randomly sampled on this distribution and fused with the original minority class latent representation, ensuring the diversity of the minority class latent representation. Second, a dual-decoder-based graph generator is designed. The pre-trained inner product decoder and graph convolution decoder make full use of the sampled latent representations to generate the topological structure and node features of graph data, respectively. Finally, a GAN discriminator is employed to detect the authenticity and class of the generated graphs, supervising the effectiveness of the generated samples, thereby achieving the generation of diverse minority class graph samples. Comparative experiments and visualization observations were conducted on five representative long-tail graph datasets. The results show that the proposed graph latent representation distribution learning-based graph oversampling method outperforms other methods by 1%-4% in terms of Acc and F1 scores, and can generate effective minority class graph samples.

Key words: long-tail recognition, variational graph auto-encoder, graph latent representation, generative adversarial network