基于模态对齐与音视频极性向量辅助的多模态情感分析

doi:10.3778/j.issn.1673-9418.2411078

摘要/Abstract

摘要： 针对三种模态特征融合不充分与音视频情感极性表达较弱的问题，提出一种基于模态对齐与音视频极性向量辅助的多模态情感分析方法（MA-PVA）。设计了模态对齐层，利用跨模态注意力机制，对音视频特征中与文本无关的情感信息进行过滤，减少不同模态间特征表达差异，将筛选结果用于增强文本模态，使文本与音视频模态特征充分融合。设计了音视频极性向量辅助任务，用于增强音视频情感极性。上述结构与预训练语言模型进行交互，能够得到更丰富的文本模态特征，以提升最终情感预测效果。所提方法在公开基准数据集CMU-MOSI与CMU-MOSEI上进行了大量实验，结果显示与最优基线方法相比，在CMU-MOSI数据集上二分类准确率分别为88.1%、89.9%，提升了0.6、0.3个百分点，七分类准确率为52.2%，提升了4.8个百分点；在CMU-MOSEI数据集上，二分类准确率分别为85.9%、87.5%，提升了1.2、0.4个百分点，七分类准确率为54.7%，提升了0.2个百分点。实验结果表明所提方法超越当前诸多性能先进的方法，有效地提高了情感分类的准确度。

关键词: 多模态情感分析, 预训练语言模型, Transformer模型, 跨模态注意力

Abstract: To address the insufficient fusion of three modality features and the weak sentiment polarity expression of audio and video, a multimodal sentiment analysis method based on modality alignment and audio-video polarity vector auxiliary (MA-PVA) is proposed. A modality alignment layer is designed, applying a cross-modal attention mechanism to filter out sentiment-irrelevant information in audio and video features, which reduces inter-modal feature expression differences and enhances the text modality by effectively merging it with audio and video features. An audio-video polarity vector auxiliary task is introduced to enhance sentiment polarity in audio and video. This framework interacts with a pretrained language model, enriching text modality features and improving sentiment prediction performance. Extensive experiments are conducted on the publicly available benchmark datasets CMU-MOSI and CMU-MOSEI. The results show that, compared with the optimal baseline method, the proposed approach achieves a binary classification accuracy of 88.1% and 89.9% on the CMU-MOSI dataset, with an improvement of 0.6 and 0.3 percentage points, and a seven-class classification accuracy of 52.2%, with an improvement of 4.8 percentage points. On the CMU-MOSEI dataset, the binary classification accuracy is 85.9% and 87.5%, with an improvement of 1.2 and 0.4 percentage points, and the seven-class classification accuracy is 54.7%, with an improvement of 0.2 percentage points. These results demonstrate that the proposed method outperforms many current advanced methods and effectively enhances the accuracy of sentiment classification.

Key words: multimodal sentiment analysis, pre-trained language model, Transformer model, cross-modal attention

李泽龙, 刘成恺, 生春雷, 卢树华. 基于模态对齐与音视频极性向量辅助的多模态情感分析[J]. 计算机科学与探索, 2025, 19(12): 3340-3352.

LI Zelong, LIU Chengkai, SHENG Chunlei, LU Shuhua. Multimodal Sentiment Analysis Based on Modality Alignment and Audio-Visual Polarity Vector Auxiliary[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(12): 3340-3352.

参考文献

[1] HONG M, JUNG J J. Multi-sided recommendation based on social tensor factorization[J]. Information Sciences, 2018, 447: 140-156.
[2] PORIA S, HAZARIKA D, MAJUMDER N, et al. Beneath the tip of the iceberg: current challenges and new directions in sentiment analysis research[J]. IEEE Transactions on Affec-tive Computing, 2023, 14(1): 108-132.
[3] BAGHER ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2236-2246.
[4] PANG B, LEE L. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts[C]//Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Stroudsburg: ACL, 2004: 271-278.
[5] VINODHINI G, CHANDRASEKARAN R M. Sentiment analysis and opinion mining: a survey[J]. International Journal of Advanced Research in Computer Science and Software Engineering, 2012, 2(6): 282-292.
[6] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2017: 1103-1114.
[7] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 5634-5641.
[8] TSAI Y H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 6558-6569.
[9] WANG D, LIU S, WANG Q, et al. Cross-modal enhancement network for multimodal sentiment analysis[J]. IEEE Transactions on Multimedia, 2023, 25: 4909-4921.
[10] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2019: 4171-4186.
[11] YANG Z L, DAI Z H, YANG Y M, et al. XLNet: generalized autoregressive pretraining for language understanding[EB/OL]. [2024-09-13]. https://arxiv.org/abs/1906.08237.
[12] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2024-09-13]. https://arxiv.org/abs/1907.11692.
[13] KE P, JI H Z, LIU S Y, et al. SentiLARE: sentiment-aware language representation learning with linguistic knowledge[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 6975-6988.
[14] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2014: 960-964.
[15] ZHU Q, YEH M C, CHENG K T, et al. Fast human detection using a cascade of histograms of oriented gradients[C]//Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2006: 1491-1498.
[16] PENNINGTON J, SOCHER R, MANNING C. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543.
[17] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[18] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2247-2256.
[19] GHOSAL D, AKHTAR M S, CHAUHAN D, et al. Contextual inter-modal attention for multi-modal sentiment analysis[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2018: 3454-3466.
[20] SUN L C, LIAN Z, LIU B, et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2024, 15(1): 309-325.
[21] HUAN R H, ZHONG G W, CHEN P, et al. UniMF: a unified multimodal framework for multimodal sentiment analysis in missing modalities and unaligned multimodal sequences[J]. IEEE Transactions on Multimedia, 2024, 26: 5753-5768.
[22] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[23] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(12): 10790-10797.
[24] CARUANA R. Multitask learning[J]. Machine Learning, 1997, 28(1): 41-75.
[25] BAXTER J. A model of inductive bias learning[J]. Journal of Artificial Intelligence Research, 2000, 12: 149-198.
[26] THRUN S. Is learning the n-th thing any easier than learning the first?[C]//Proceedings of the 9th International Conference on Neural Information Processing Systems, 1995: 640-646.
[27] CARUANA R A. Multitask learning: a knowledge-based source of inductive bias[C]//Proceedings of the 10th International Conference on Machine Learning, 1993: 41-48.
[28] DUONG L, COHN T, BIRD S, et al. Low resource dependency parsing: cross-lingual parameter sharing in a neural network parser[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2015: 845-850.
[29] AKHTAR M S, CHAUHAN D S, GHOSAL D, et al. Multi-task learning for multi-modal emotion recognition and sentiment analysis[EB/OL]. [2024-09-14]. https://arxiv.org/abs/1905.05812.
[30] CHAUHAN D S, DHANUSH S R, EKBAL A, et al. Sentiment and emotion help sarcasm? A multi-task learning frame-work for multi-modal sarcasm, sentiment and emotion analysis[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 4351-4360.
[31] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008.
[32] MISRA D. Mish: a self regularized non-monotonic activation function[EB/OL]. [2024-09-14]. https://arxiv.org/abs/1908.08681.
[33] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL]. [2024-09-14]. https://arxiv.org/abs/1606.06259.
[34] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 5642-5649.
[35] LIN H, ZHANG P L, LING J D, et al. PS-Mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing and Management, 2023, 60(2): 103229.
[36] TOLSTIKHIN I O, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021: 24261-24272.
[37] RAHMAN W, HASAN M K, LEE S W, et al. Integrating multimodal information in large pretrained transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 2359-2369.
[38] LI Y Q, WENG W X, LIU C, et al. CSMF-SPC: multimodal sentiment analysis model with effective context semantic modality fusion and sentiment polarity correction[J]. Pattern Analysis and Applications, 2024, 27(3): 104.
[39] LUO Y Y, WU R, LIU J F, et al. Balanced sentimental information via multimodal interaction model[J]. Multimedia Systems, 2024, 30(1): 10.
[40] LIU Z J, CAI L, YANG W J, et al. Sentiment analysis based on text information enhancement and multimodal feature fusion[J]. Pattern Recognition, 2024, 156: 110847.
[41] PENG H, GU X, LI J, et al. Text-centric multimodal contrastive learning for sentiment analysis[J]. Electronics, 2024, 13(6): 1149.
[42] GAN C Q, TANG Y, FU X, et al. Video multimodal sentiment analysis using cross-modal feature translation and dynamical propagation[J]. Knowledge-Based Systems, 2024, 299: 111982.
[43] WANG Z J, JIANG N C, CHAO X Y, et al. Multi-task disagreement-reducing multimodal sentiment fusion network[J]. Image and Vision Computing, 2024, 149: 105158.
[44] ZHU L N, ZHAO H Y, ZHU Z C, et al. Multimodal sentiment analysis with unimodal label generation and modality decomposition[J]. Information Fusion, 2025, 116: 102787.
[45] FU Y, HUANG B, WEN Y J, et al. FDR-MSA: enhancing multimodal sentiment analysis through feature disentanglement and reconstruction[J]. Knowledge-Based Systems, 2024, 297: 111965.