Journal of Frontiers of Computer Science and Technology ›› 2025, Vol. 19 ›› Issue (12): 3340-3352.DOI: 10.3778/j.issn.1673-9418.2411078

• Artificial Intelligence·Pattern Recognition • Previous Articles     Next Articles

Multimodal Sentiment Analysis Based on Modality Alignment and Audio-Visual Polarity Vector Auxiliary

LI Zelong, LIU Chengkai, SHENG Chunlei, LU Shuhua   

  1. 1. College of Information and Cyber Security, People's Public Security University of China, Beijing 100038, China
    2. Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 100026, China
  • Online:2025-12-01 Published:2025-12-01

基于模态对齐与音视频极性向量辅助的多模态情感分析

李泽龙,刘成恺,生春雷,卢树华   

  1. 1. 中国人民公安大学 信息网络安全学院,北京 100038
    2. 安全防范技术与风险评估公安部重点实验室,北京 100026

Abstract: To address the insufficient fusion of three modality features and the weak sentiment polarity expression of audio and video, a multimodal sentiment analysis method based on modality alignment and audio-video polarity vector auxiliary (MA-PVA) is proposed. A modality alignment layer is designed, applying a cross-modal attention mechanism to filter out sentiment-irrelevant information in audio and video features, which reduces inter-modal feature expression differences and enhances the text modality by effectively merging it with audio and video features. An audio-video polarity vector auxiliary task is introduced to enhance sentiment polarity in audio and video. This framework interacts with a pretrained language model, enriching text modality features and improving sentiment prediction performance. Extensive experiments are conducted on the publicly available benchmark datasets CMU-MOSI and CMU-MOSEI. The results show that, compared with the optimal baseline method, the proposed approach achieves a binary classification accuracy of 88.1% and 89.9% on the CMU-MOSI dataset, with an improvement of 0.6 and 0.3 percentage points, and a seven-class classification accuracy of 52.2%, with an improvement of 4.8 percentage points. On the CMU-MOSEI dataset, the binary classification accuracy is 85.9% and 87.5%, with an improvement of 1.2 and 0.4 percentage points, and the seven-class classification accuracy is 54.7%, with an improvement of 0.2 percentage points. These results demonstrate that the proposed method outperforms many current advanced methods and effectively enhances the accuracy of sentiment classification.

Key words: multimodal sentiment analysis, pre-trained language model, Transformer model, cross-modal attention

摘要: 针对三种模态特征融合不充分与音视频情感极性表达较弱的问题,提出一种基于模态对齐与音视频极性向量辅助的多模态情感分析方法(MA-PVA)。设计了模态对齐层,利用跨模态注意力机制,对音视频特征中与文本无关的情感信息进行过滤,减少不同模态间特征表达差异,将筛选结果用于增强文本模态,使文本与音视频模态特征充分融合。设计了音视频极性向量辅助任务,用于增强音视频情感极性。上述结构与预训练语言模型进行交互,能够得到更丰富的文本模态特征,以提升最终情感预测效果。所提方法在公开基准数据集CMU-MOSI与CMU-MOSEI上进行了大量实验,结果显示与最优基线方法相比,在CMU-MOSI数据集上二分类准确率分别为88.1%、89.9%,提升了0.6、0.3个百分点,七分类准确率为52.2%,提升了4.8个百分点;在CMU-MOSEI数据集上,二分类准确率分别为85.9%、87.5%,提升了1.2、0.4个百分点,七分类准确率为54.7%,提升了0.2个百分点。实验结果表明所提方法超越当前诸多性能先进的方法,有效地提高了情感分类的准确度。

关键词: 多模态情感分析, 预训练语言模型, Transformer模型, 跨模态注意力