计算机科学与探索

• 学术研究 •    下一篇

融合生成扩散模型的不完全多模态情绪识别

马飞, 王玉婷, 杨飞霞, 徐光宪   

  1. 1. 辽宁工程技术大学 电子与信息工程学院, 辽宁 葫芦岛 125105
    2. 辽宁工程技术大学 辽宁省无线射频大数据智能应用重点实验室, 辽宁 葫芦岛 125105
    3. 辽宁工程技术大学 电气与控制工程学院, 辽宁 葫芦岛 125105

Generative Diffusion Model for Incomplete Multimodal Emotion Recognition

MA Fei,  WANG Yuting,  YANG Feixia,  XU Guangxian   

  1. 1. School of Electronic and Information Engineering, Liaoning Technical University, Huludao, Liaoning 125105, China
    2. Liaoning Key Laboratory of Radio Frequency and Big Data for Intelligent Applications, Liaoning Technical University, Huludao, Liaoning 125105, China
    3. School of Electrical and Control Engineering, Liaoning Technical University, Huludao, Liaoning 125105, China

摘要: 人类多模态情绪识别在通过文本、视觉和声音等各种异构模态数据用于感知并理解人类情感。与单一模态相比,多模态数据中的互补信息有助于更稳健地理解情感。然而,在实际多模态场景中常存在不完全或缺失模态信息,严重阻碍对多模态特征的理解,从而导致情绪识别精度下降。针对以往的多模态情绪识别方法未能有效地处理模态在不完全或缺失情况下产生的识别精度下降的问题,提出了一种融合生成扩散模型的不完全多模态情绪识别,通过重构不完全模态数据信息,以提升情绪识别的精度。首先,构建基于跨模态条件随机微分方程的生成扩散模型,在逆扩散过程中将可用模态信息通过可学习投影转化为漂移项的动态约束,生成不完全模态信息特征;其次,构建不完全模态生成网络与融合重构模块的双向协同优化框架,利用联合目标函数实现生成质量与特征融合的梯度反向传播交互,通过分层注意力机制建立补全的不完全模态特征与真实特征的情感语义一致性约束。经过几组数据集测试结果表明,所提出的多模态情绪识别方法在多种不完全模态场景中取得了优越的情绪识别性能。

关键词: 多模态情绪识别, 得分网络补全, 融合重构

Abstract: Human multimodal emotion recognition utilizes various heterogeneous modal data, such as language, vision, and sound, to perceive and understand human emotions. Compared to single modalities, the complementary information in multimodal data helps to understand emotions more robustly. However, in practical multimodal scenarios, incomplete or missing modal information often exists, severely hindering the understanding of multimodal features and leading to a decline in emotion recognition accuracy. To address the issue of recognition accuracy decline caused by incomplete or missing modalities, which has not been effectively handled by previous multimodal emotion recognition methods, we propose an Incomplete Multimodal Emotion Recognition method based on Fusion Generative Diffusion Models. This method aims to enhance emotion recognition accuracy by reconstructing incomplete modal data information. First, a generative diffusion model based on cross-modal conditional stochastic differential equations is constructed. In the back-diffusion process, the available modal information is converted into dynamic constraints of the drift term through learnable projection to generate incomplete modal information features. Second, a two-way collaborative optimization framework of the incomplete modal generation network and the fusion reconstruction module is constructed. The joint objective function is used to realize the gradient backpropagation interaction of generation quality and feature fusion, and the emotional semantic consistency constraint of generated features and real features is established through a hierarchical attention mechanism. Results from several dataset tests indicate that the proposed multimodal emotion recognition method achieves superior performance in various incomplete modality scenarios.

Key words: multimodal emotion recognition, score network completion, fusion reconstruction