Multimodal Sentiment Analysis Based on Cross-Modal Semantic Information Enhancement

doi:10.3778/j.issn.1673-9418.2307045

Abstract

Abstract: With the development of social networks, humans express their emotions in different ways, including text, vision and speech, i.e., multimodal. In response to the failure of previous multimodal sentiment analysis methods to effectively obtain multimodal sentiment feature representations and the failure to fully consider the impact of redundant information on experiments during multimodal feature fusion, a multimodal sentiment analysis model based on cross-modal semantic information enhancement is proposed. Firstly, the model adopts BiLSTM network to mine the contextual information within each unimodal mode. Secondly, the information interaction between multiple modalities is modeled through the cross-modal information interaction mechanism to obtain six kinds of information interaction features, namely, text-to-speech and vision, speech-to-text and vision, and vision-to-text and speech, and then the same information interaction features of the target modalities are spliced together to obtain the information-enhanced unimodal feature vectors, which can efficiently obtain the shared and complementary in-depth semantic features between modalities. In addition, the semantic correlations between the original unimodal feature vectors and the information-enhanced unimodal feature vectors are computed separately using the multi-head self-attention mechanism, which improves the ability of identifying the key sentiment features and reduces the negative interference of the redundant information on the sentiment analysis. Experimental results on the public datasets CMU-MOSI (CMU multimodal opinion level sentiment intensity) and CMU-MOSEI (CMU multimodal opinion sentiment and emotion intensity) show that the proposed model can both enhance sentiment feature representation and effectively reduce the interference of redundant information, and it outperforms related works in terms of multimodal sentiment classification accuracy and generalization ability.

Key words: multimodal emotional analysis, information augmentation, information interaction, multi-head attention mechanism, feature fusion

摘要： 随着社交网络的发展，人类通过不同的方式表达自己的情感，包括文本、视觉和语音，即多模态。针对以往的多模态情感分析方法未能有效地获取多模态情感特征表示，以及没有充分考虑在多模态特征融合过程中冗余信息对实验的影响，提出了一种基于跨模态语义信息增强的多模态情感分析模型。该模型采用BiLSTM网络挖掘各单模态内部存在的上下文信息。通过跨模态信息交互机制对多种模态间的信息交互进行建模，得到文本对语音、视觉，语音对文本、视觉，视觉对文本、语音六种信息交互特征，将目标模态相同的信息交互特征进行拼接，得到信息增强后的单模态特征向量，有效地获取模态间共享和补充的深度语义特征。另外，使用多头自注意力机制分别计算原始单模态特征向量和信息增强后的单模态特征向量间存在的语义相关性，提高识别关键情感特征的能力，降低冗余信息对情感分析的负面干扰。在公共数据集CMU-MOSI和CMU-MOSEI的实验结果表明，所提出的模型既能增强情感特征表示，也能有效降低冗余信息的干扰，在多模态情感分类准确率和泛化能力上的表现优于相关工作。

关键词: 多模态情感分析, 信息增强, 信息交互, 多头注意力机制, 特征融合

LI Mengyun, ZHANG Jing, ZHANG Huanxiang, ZHANG Xiaolin, LIU Luyao. Multimodal Sentiment Analysis Based on Cross-Modal Semantic Information Enhancement[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2476-2486.

李梦云, 张景, 张换香, 张晓琳, 刘璐瑶. 基于跨模态语义信息增强的多模态情感分析[J]. 计算机科学与探索, 2024, 18(9): 2476-2486.

References

[1] CHEN M, WANG S, LIANG P, et al. Multimodal sentiment analysis with word-level fusion and reinforcement learning[C]//Proceedings of the 19th ACM International Conference on Multimodal Interaction, Seattle, Nov 13-17, 2017. New York: ACM, 2017: 163-171.
[2] HAZARIKA D, ZIMMERMANN R, PORIA S. Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle, Oct 12-16, 2020. New York: ACM, 2020: 1131.
[3] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[EB/OL]. [2023-05-13].https://arxiv.org/abs/1707.07250.
[4] ZHANG Y, SONG D, ZHANG P, et al. A quantum-inspired multimodal sentiment analysis framework[J]. Theoretical Computer Science, 2018, 752: 21-40.
[5] ZHANG T. Recent trends in neural networks for multimedia processing[C]//Proceedings of the 6th Seminar on Neural Network Applications in Electrical Engineering, Yugoslavia, Sep 26-28, 2002. Piscataway: IEEE, 2002: 41-45.
[6] GU J, WANG Z, KUEN J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77:354-377.
[7] SCHMIDHUBER J. Deep learning in neural networks: an overview[J]. Neural Networks, 2015, 61: 85-117.
[8] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[9] PORIA S, CAMBRIA E, HAZARIKA D, et al. Multilevel multiple attentions for contextual multimodal sentiment analysis[C]//Proceedings of the 2017 IEEE International Conference on Data Mining. Piscataway: IEEE, 2017: 1033-1038.
[10] 宋绪靖. 基于文本、语音和视频的多模态情感识别的研究[D]. 济南: 山东大学, 2019: 1-57.
SONG X J. The study of multimodal emotion recognition based on text, speech and video[D]. Jinan: Shandong University, 2019: 1-57.
[11] WU Y, LIN Z, ZHAO Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis[C]//Findings of the Association for Computational Linguistics, Aug 1-6, 2021. Stroudsburg: ACL, 2021:4730-4738.
[12] ZHANG K, LI Y, WANG J, et al. Real-time video emotion recognition based on reinforcement learning and domain knowledge[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 32(3): 1034-1047.
[13] 张亚洲, 戎璐, 宋大为, 等. 多模态情感分析研究综述[J]. 模式识别与人工智能, 2020, 33(5): 426-438.
ZHANG Y Z, RONG L, SONG D W, et al. A survey on multimodal sentiment analysis[J]. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 426-438.
[14] WILLIAMS J, KLEINEGESSE S, COMANESCU R, et al. Recognizing emotions in video using multimodal DNN feature fusion[C]//Proceedings of the 2018 Grand Challenge Workshop on Human Multimodal Language, Melbourne, Jul 20, 2018. Stroudsburg: ACL, 2018: 11-19.
[15] ABDU S A, YOUSEF A H, SALEM A. Multimodal video sentiment analysis using deep learning approaches, a survey[J]. Information Fusion, 2021, 76: 204-226.
[16] LI R, ZHAO J, HU J, et al. Multi-modal fusion for video sentiment analysis[C]//Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop, Seattle, Oct 16, 2020. New York: ACM, 2020: 19-25.
[17] 蔡国永, 吕光瑞, 徐智, 等. 基于层次化深度关联融合网络的社交媒体情感分类[J]. 计算机研究与发展, 2019, 56(6): 1312-1324.
CAI G Y, LYU G R, XU Z, et al. A hierarchical deep correlative fusion network for sentiment classification in social media[J]. Journal of Computer Research and Development, 2019, 56(6): 1312-1324.
[18] 林敏鸿, 蒙祖强. 基于注意力神经网络的多模态情感分析[J]. 计算机科学, 2020, 47(11): 508-514.
LIN M H, MENG Z Q. Multimodal sentiment analysis based on attention neural network[J]. Computer Science, 2020, 47(11): 508-514.
[19] 刘启元, 张栋, 吴良庆, 等. 基于上下文增强LSTM的多模态情感分析[J]. 计算机科学, 2019, 46(11): 181-185.
LIU Q Y, ZHANG D, WU L Q, et al. Multi-modal sentiment analysis with context-augmented LSTM[J]. Computer Science, 2019, 46(11): 181-185.
[20] YAN X, XUE H, JIANG S, et al. Multimodal sentiment analysis using multi-tensor fusion network with cross-modal modeling[J]. Applied Artificial Intelligence, 2021, 36(1).
[21] GKOUMAS D, LI Q, LIOMA C, et al. What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis[J]. Information Fusion, 2021, 66: 184-197.
[22] 刘宇宸, 宗成庆. 跨模态信息融合的端到端语音翻译[J]. 软件学报, 2023, 34(4): 1837-1849.
LIU Y C, ZONG C Q. End-to-end speech translation by integrating cross-modal information[J]. Journal of Software, 2023, 34(4): 1837-1849.
[23] YANG B, SHAO B, WU L, et al. Multimodal sentiment analysis with unidirectional modality translation[J]. Neuro-computing, 2022, 467: 130-137.
[24] TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Jul 28-Aug 2, 2019. Stroudsburg: ACL, 2019: 6558-6569.
[25] WANG Z, WAN Z, WAN X. TransModality: an end2end fusion method with transformer for multimodal sentiment analysis[C]//Proceedings of the Web Conference 2020, Taipei, China, Apr 20-24, 2020. New York: ACM, 2020: 2514-2520.
[26] WANG F, TIAN S, YU L, et al. Transformer-based encoding-decoding translation network for multimodal sentiment analysis[J]. Cognitive Computation, 2022, 151: 289-303.
[27] HUDDAR M, SANNAKKI S, RAJPUROHIT V. Multi-level context extraction and attention-based contextual inter-modal fusion multimodal sentiment analysis and emotion classification[J]. International Journal of Multimedia Information Retrieval, 2020, 9(2): 103-112.
[28] XI C, LU G, YAN J. Multimodal sentiment analysis based on multi-head attention mechanism[C]//Proceedings of the 4th International Conference on Machine Learning and Soft Computing, Haiphong City, Jan 17-19, 2020. New York: ACM, 2020: 34-39.
[29] OLSON D. From utterance to text: the bias of language in speech and writing[J]. Harvard Educational Review, 1977, 47(3): 257-281.
[30] ZHI Y, TONG Z, WANG L, et al. MGSampler: an explainable sampling strategy for video action recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 1513-1522.
[31] ZHANG K, ZHANG Z, LI Z, et al. Joint face detection and alignment using multitask cascaded convolutional networks[J]. IEEE Signal Processing Letters, 2016, 23(10): 1499-1503.
[32] ZADEH A, LIANG P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Jul 15-20, 2018. Stroudsburg: ACL, 2018: 2236-2246.
[33] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion[EB/OL]. [2023-05-13]. https://arxiv.org/abs/1606.06259.
[34] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2018: 5634-5641.
[35] TSAI Y H H, LIANG P P, ZADEH A, et al. Learning factorized multimodal representations[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2018: 5649-5657.
[36] HAZARIKA D, ZIMMERMANN R, MISA S P. Modality-invariant and-specific representations for multimodal sentiment analysis[EB/OL]. [2023-05-13]. https://doi.org/10.1145/3394171.3413678.
[37] YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2021: 10790-10797.
[38] HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2021: 5421-5432.
[39] 吕学强, 田驰, 张乐, 等. 融合多特征和注意力机制的多模态情感分析模型[J]. 数据分析与知识发现, 2024, 8(5): 91-101.
LYU X Q, TIAN C, ZHANG L, et al. Multimodal sentiment analysis model integrating multi-features and attention mechanism[J]. Data Analysis and Knowledge Discovery, 2024, 8(5): 91-101.
[40] SUN H, WANG H, LIU J, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation[C]//Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Oct 10-14, 2022. New York: ACM, 2022: 3722-3729.