Review on Key Techniques of Video Multimodal Sentiment Analysis

doi:10.3778/j.issn.1673-9418.2404072

Abstract

Abstract: Sentiment analysis is the process of automatically determining an opinion holder􀆳s attitude or emotional tendency. It is widely used in business, social media analysis, and public opinion monitoring. In unimodal sentiment analysis, most researchers use text, facial expressions, and audio information. With the development of deep learning technology, sentiment analysis has expanded from a unimodal to a multimodal field. Combining multiple modalities can address the limitations of a unimodal and understand the emotions expressed by people more accurately and comprehensively. This paper summarizes the critical techniques of multimodal sentiment analysis based on three kinds of unimodal sentiment analysis. Firstly, the multimodal sentiment analysis background and its research status are briefly introduced. Secondly, the relevant datasets that are commonly used are summarized. Then, this paper describes the unimodal sentiment analysis based on text, facial expression, and audio information. In addition, this paper analyzes the critical techniques of video multimodal sentiment analysis, including multimodal fusion, alignment and modal noise processing, and provides a detailed analysis of these techniques’ relationships and their applications. Next, the performance metrics of different models on three commonly used datasets are analyzed, further validating the effectiveness of these key techniques. Finally, the existing challenges in multimodal sentiment analysis and future development trends are discussed.

Key words: sentiment analysis, multimodal, modal fusion, modal alignment, modal noise

摘要： 情感分析是自动判定观点持有者所表现的态度或情绪倾向性的过程，其在商业、社交媒体分析和舆情监测等领域得到了广泛应用。在单一模态情感分析中，多数研究者使用文本、面部表情和音频信息来进行分析。然而，随着深度学习技术的快速发展，情感分析逐渐从单一模态扩展至多模态领域，综合多种模态，能够克服单一模态存在的局限性并更加准确和全面地理解人们所表达的情感。以三种单模态情感分析为基础对多模态情感分析中的关键技术进行了综述。简要介绍了多模态情感分析的背景和目前的研究现状；总结了常用的相关数据集；分别对基于文本、面部表情和音频信息的单模态情感分析进行了简要叙述；重点梳理了视频多模态情感分析中的关键技术，如多模态融合、对齐和模态噪声处理，并对这些技术的关系与应用进行了详细分析；对不同模型在三种常用数据集上的性能指标进行了分析，进一步验证了关键技术的有效性。讨论了多模态情感分析现存问题和未来的发展趋势。

关键词: 情感分析, 多模态, 模态融合, 模态对齐, 模态噪声

DUAN Zongtao, HUANG Junchen, ZHU Xiaole. Review on Key Techniques of Video Multimodal Sentiment Analysis[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(3): 539-558.

段宗涛, 黄俊臣, 朱晓乐. 视频多模态情感分析关键技术研究综述[J]. 计算机科学与探索, 2025, 19(3): 539-558.

References

[1] CABANAC M. What is emotion?[J]. Behavioural Processes, 2002, 60(2): 69-83.
[2] 李阳, 王石, 朱俊武, 等. 方面级情感分析综述[J]. 计算机科学, 2023, 50(S1): 34-40.
LI Y, WANG S, ZHU J W, et al. A summary of aspect-level emotional analysis[J]. Computer Science, 2023, 50(S1): 34-40.
[3] LI S, DENG W H. Deep facial expression recognition: a survey[J]. IEEE Transactions on Affective Computing, 2022, 13(3): 1195-1215.
[4] 孙影影, 贾振堂, 朱昊宇. 多模态深度学习综述[J]. 计算机工程与应用, 2020, 56(21): 1-10.
SUN Y Y, JIA Z T, ZHU H Y. Survey of multimodal deep learning[J]. Computer Engineering and Applications, 2020, 56(21): 1-10.
[5] LUO M, FEI H, LI B B, et al. PanoSent: a panoptic sextuple extraction benchmark for multimodal conversational aspect-based sentiment analysis[EB/OL]. [2024-09-23]. https://arxiv.org/abs/2408.09481.
[6] 刘继明, 张培翔, 刘颖, 等. 多模态的情感分析技术综述[J]. 计算机科学与探索, 2021, 15(7): 1165-1182.
LIU J M, ZHANG P X, LIU Y, et al. Summary of multi-modal sentiment analysis technology[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(7): 1165-1182.
[7] DAS R, SINGH T D. Multimodal sentiment analysis: a survey of methods, trends, and challenges[J]. ACM Computing Surveys, 2023, 55(13s): 1-38.
[8] 郭续, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 基于多模态融合的情感分析算法研究综述[J]. 计算机工程与应用, 2024, 60(2): 1-18.
GUO X, MAIRIDAN WUSHOUER, GULANBAIER TUERHONG. Survey of sentiment analysis algorithms based on multimodal fusion[J]. Computer Engineering and Applications, 2024, 60(2): 1-18.
[9] ZHANG Y F, LAI G K, ZHANG M, et al. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis[C]//Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. New York: ACM, 2014: 83-92.
[10] XU N, MAO W J, CHEN G D, et al. Multi-interactive memory network for aspect based multimodal sentiment analysis[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence and the 31st Innovative Applications of Artificial Intelligence Conference and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto: AAAI, 2019: 371-378.
[11] MCKEOWN G, VALSTAR M, COWIE R, et al. The SEMAINE database: annotated multimodal records of emotionally colored conversations between a person and a limited agent[J]. IEEE Transactions on Affective Computing, 2012, 3(1): 5-17.
[12] KOELSTRA S, MUHL C, SOLEYMANI M, et al. DEAP: a database for emotion analysis;using physiological signals[J]. IEEE Transactions on Affective Computing, 2012, 3(1): 18-31.
[13] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42(4): 335-359.
[14] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL]. [2024-01-25]. https://arxiv.org/abs/1606.06259.
[15] BAGHER ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2236-2246.
[16] YU W M, XU H, MENG F Y, et al. CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3718-3727.
[17] MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web[C]//Proceedings of the 13th International Conference on Multimodal Interfaces. New York: ACM, 2011: 169-176.
[18] WÖLLMER M, WENINGER F, KNAUP T, et al. YouTube movie reviews: sentiment analysis in an audio-visual context[J]. IEEE Intelligent Systems, 2013, 28(3): 46-53.
[19] PORIA S, HAZARIKA D, MAJUMDER N, et al. MELD: a multimodal multi-party dataset for emotion recognition in conversations[EB/OL]. [2024-01-25]. https://arxiv.org/abs/1810.02508.
[20] 王颖洁, 朱久祺, 汪祖民, 等. 自然语言处理在文本情感分析领域应用综述[J]. 计算机应用, 2022, 42(4): 1011-1020.
WANG Y J, ZHU J Q, WANG Z M, et al. Review of applications of natural language processing in text sentiment analysis[J]. Journal of Computer Applications, 2022, 42(4): 1011-1020.
[21] 赵妍妍, 秦兵, 刘挺. 文本情感分析[J]. 软件学报, 2010, 21(8): 1834-1848.
ZHAO Y Y, QIN B, LIU T. Sentiment analysis[J]. Journal of Software, 2010, 21(8): 1834-1848.
[22] 栗雨晴, 礼欣, 韩煦, 等. 基于双语词典的微博多类情感分析方法[J]. 电子学报, 2016, 44(9): 2068-2073.
LI Y Q, LI X, HAN X, et al. A bilingual lexicon-based multi-class semantic orientation analysis for microblogs[J]. Acta Electronica Sinica, 2016, 44(9): 2068-2073.
[23] ARAQUE O, ZHU G G, IGLESIAS C A. A semantic similarity-based perspective of affect lexicons for sentiment analysis[J]. Knowledge-Based Systems, 2019, 165: 346-359.
[24] PRUSA J, KHOSHGOFTAAR T M, DITTMAN D J. Using ensemble learners to improve classifier performance on Tweet sentiment data[C]//Proceedings of the 2015 IEEE International Conference on Information Reuse and Integration. Piscataway: IEEE, 2015: 252-257.
[25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008.
[26] CHEN J Y, YAN S K, WONG K C. Verbal aggression detection on twitter comments: convolutional neural network for short-text sentiment analysis[J]. Neural Computing and Applications, 2020, 32(15): 10809-10818.
[27] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2024-01-25]. https://arxiv.org/abs/1301.3781.
[28] PENNINGTON J, SOCHER R, MANNING C. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543.
[29] 曹宇, 李天瑞, 贾真, 等. BGRU: 中文文本情感分析的新方法[J]. 计算机科学与探索, 2019, 13(6): 973-981.
CAO Y, LI T R, JIA Z, et al. BGRU: new method of Chinese text sentiment analysis[J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(6): 973-981.
[30] BAZIOTIS C, PELEKIS N, DOULKERIDIS C. DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis[C]//Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Stroudsburg: ACL, 2017: 747-754.
[31] MUNIKAR M, SHAKYA S, SHRESTHA A. Fine-grained sentiment classification using BERT[C]//Proceedings of the 2019 Artificial Intelligence for Transforming Business and Society. Piscataway: IEEE, 2019.
[32] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2024-01-25]. https://arxiv.org/abs/1810.04805.
[33] XU H, LIU B, SHU L, et al. DomBERT: domain-oriented language model for aspect-based sentiment analysis[C]//Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: ACL, 2020: 1725-1731.
[34] CHEN S W, LIU J, WANG Y, et al. Synchronous double-channel recurrent network for aspect-opinion pair extraction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 6515-6524.
[35] LI Y C, WANG F, ZHANG W J, et al. A more fine-grained aspect-sentiment-opinion triplet extraction task[EB/OL]. [2024-01-25]. https://arxiv.org/abs/2103.15255.
[36] LI B B, LI Y Q, JIA S Y, et al. Triple GNNs: introducing syntactic and semantic information for conversational aspect-based quadruple sentiment analysis[EB/OL]. [2024-05-14]. https://arxiv.org/abs/2403.10065.
[37] ZHANG Z H, ZUO Y, WU J J. Aspect sentiment triplet extraction: a Seq2Seq approach with span copy enhanced dual decoder[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2729-2742.
[38] MAO Y, SHEN Y, YANG J C, et al. Seq2Path: generating sentiment tuples as paths of a tree[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: ACL, 2022: 2215-2225.
[39] KELTNER D, EKMAN P, GONZAGA G C, et al. Facial expression of emotion[M]//Handbook of affective sciences. New York: Oxford University Press, 2002: 415-432.
[40] TIAN Y L, KANADE T, COHN J F. Recognizing action units for facial expression analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(2): 97-115.
[41] YOU Q Z, LUO J B, JIN H L, et al. Robust image sentiment analysis using progressively trained and domain transferred deep networks[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2015: 381-388.
[42] 李校林, 钮海涛. 基于VGG-NET的特征融合面部表情识别[J]. 计算机工程与科学, 2020, 42(3): 500-509.
LI X L, NIU H T. Facial expression recognition using feature fusion based on VGG-NET[J]. Computer Engineering & Science, 2020, 42(3): 500-509.
[43] WANG K, PENG X J, YANG J F, et al. Suppressing uncertainties for large-scale facial expression recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 6896-6905.
[44] BUSTOS C, CIVIT C, DU B, et al. On the use of vision-language models for visual sentiment analysis: a study on CLIP[C]//Proceedings of the 2023 11th International Conference on Affective Computing and Intelligent Interaction. Piscataway: IEEE, 2023: 1-8.
[45] EYBEN F, WÖLLMER M, SCHULLER B. OpenEAR: introducing the Munich open-source emotion and affect recognition toolkit[C]//Proceedings of the 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. Piscataway: IEEE, 2009: 1-6.
[46] EYBEN F, WÖLLMER M, SCHULLER B. OpenSMILE: the Munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM International Conference on Multimedia. New York: ACM, 2010: 1459-1462.
[47] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2014: 960-964.
[48] REN Z, JIA J, GUO Q, et al. Acoustics, content and geo-information based sentiment prediction from large-scale networked voice data[C]//Proceedings of the 2014 IEEE International Conference on Multimedia and Expo. Piscataway: IEEE, 2014: 1-4.
[49] GHARSELLAOUI S, SELOUANI S A, DAHMANE A O. Automatic emotion recognition using auditory and prosodic indicative features[C]//Proceedings of the 2015 IEEE 28th Canadian Conference on Electrical and Computer Engineering. Piscataway: IEEE, 2015: 1265-1270.
[50] YANG K N, WANG C F, GU Y, et al. Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition[J]. IEEE Transactions on Affective Computing, 2023, 14(2): 1082-1097.
[51] PÉREZ-ROSAS V, MIHALCEA R, MORENCY L P. Utterance-level multimodal sentiment analysis[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2013: 973-982.
[52] PORIA S, CHATURVEDI I, CAMBRIA E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis[C]//Proceedings of the 2016 IEEE 16th International Conference on Data Mining. Piscataway: IEEE, 2016: 439-448.
[53] XU N, MAO W J. MultiSentiNet: a deep semantic network for multimodal sentiment analysis[C]//Proceedings of the 2017 ACM Conference on Information and Knowledge Management. New York: ACM, 2017: 2399-2402.
[54] XU N, MAO W J, CHEN G D, et al. A co-memory network for multimodal sentiment analysis[C]//Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. New York: ACM, 2018: 929-932.
[55] PORIA S, CAMBRIA E, HAZARIKA D, et al. Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2017: 873-883.
[56] PORIA S, CAMBRIA E, GELBUKH A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2015: 2539-2544.
[57] YU Y H, LIN H F, MENG J N, et al. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks[J]. Algorithms, 2016, 9(2): 41.
[58] NOJAVANASGHARI B, GOPINATH D, KOUSHIK J, et al. Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction. New York: ACM, 2016: 284-288.
[59] WILLIAMS J, COMANESCU R, RADU O, et al. DNN multimodal fusion techniques for predicting video sentiment[C]//Proceedings of Grand Challenge and Workshop on Human Multimodal Language. Stroudsburg: ACL, 2018: 64-72.
[60] WANG H H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis[C]//Proceedings of the 2017 IEEE International Conference on Multimedia and Expo. Piscataway: IEEE, 2017: 949-954.
[61] GUNES H, PICCARDI M. Bi-modal emotion recognition from expressive face and body gestures[J]. Journal of Network and Computer Applications, 2007, 30(4): 1334-1345.
[62] MORALES M, SCHERER S, LEVITAN R. A linguistically-informed fusion approach for multimodal depression detection[C]//Proceedings of the 5th Workshop on Computational Linguistics and Clinical Psychology: from Keyboard to Clinic. Stroudsburg: ACL, 2018: 13-24.
[63] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[EB/OL]. [2024-01-25]. https://arxiv.org/abs/1707.07250.
[64] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[EB/OL]. [2024-01-25]. https://arxiv.org/abs/1806.00064.
[65] VERMA S, WANG C, ZHU L M, et al. DeepCU: integrating both common and unique latent information for multimodal sentiment analysis[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI, 2019: 3627-3634.
[66] YANG K C, XU H, GAO K, et al. CM-BERT[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 521-528.
[67] YU T S, GAO H Y, LIN T E, et al. Speech-text pre-training for spoken dialog understanding with explicit cross-modal alignment[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 7900-7913.
[68] TSAI Y H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 6558-6569.
[69] HU G M, LIN T E, ZHAO Y, et al. UniMSE: towards unified multimodal sentiment analysis and emotion recognition[EB/OL]. [2024-05-14]. https://arxiv.org/abs/2211.11256.
[70] ANAND S, DEVULAPALLY N K, DAS BHATTACHARJEE S, et al. Multi-label emotion analysis in conversation via multimodal knowledge distillation[C]//Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM, 2023: 6090-6100.
[71] NGUYEN Q H, NGUYEN M T, NGUYEN K V. New benchmark dataset and fine-grained cross-modal fusion framework for Vietnamese multimodal aspect-category sentiment analysis[EB/OL]. [2024-09-23]. https://arxiv.org/abs/2405.00543.
[72] ZHU R Y, HAN C C, QIAN Y, et al. Exchanging-based multimodal fusion with transformer[EB/OL]. [2024-05-14]. https://arxiv.org/abs/2309.02190.
[73] WANG S M, SHUAI H, LIU Q S, et al. Cooperative sentiment agents for multimodal sentiment analysis[EB/OL]. [2024-05-14]. https://arxiv.org/abs/2404.12642.
[74] WU Z H, GONG Z W, KOO J, et al. Multimodal multi-loss fusion network for sentiment analysis[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2024: 3588-3602.
[75] XU H Y, ZHANG H, HAN K, et al. Learning alignment for multimodal emotion recognition from speech[EB/OL]. [2024-05-14]. https://arxiv.org/abs/1909.05645.
[76] FRANCESCHINI R, FINI E, BEYAN C, et al. Multimodal emotion recognition with modality-pairwise unsupervised contrastive loss[C]//Proceedings of the 2022 26th International Conference on Pattern Recognition. Piscataway: IEEE, 2022: 2589-2596.
[77] LI Z J, LIN T E, WU Y C, et al. UniSA: unified generative framework for sentiment analysis[C]//Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM, 2023: 6132-6142.
[78] ZHAO J M, LI R C, JIN Q. Missing modality imagination network for emotion recognition with uncertain missing modalities[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 2608-2618.
[79] CHI H Z, YANG M H, ZHU J H, et al. Missing modality meets meta sampling (M3S): an efficient universal approach for multimodal sentiment analysis with missing modality[EB/OL]. [2024-05-14]. https://arxiv.org/abs/2210.03428.
[80] LIN R H, HU H F. MissModal: increasing robustness to missing modality in multimodal sentiment analysis[J]. Transactions of the Association for Computational Linguistics, 2023, 11: 1686-1702.
[81] GUO Z R, JIN T, ZHAO Z. Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition[EB/OL]. [2024-09-23]. https://arxiv.org/abs/2407. 05374.
[82] ZHAO X B, PORIA S, LI X J, et al. Toward robust multimodal learning using multimodal foundational models[EB/OL]. [2024-05-14]. https://arxiv.org/abs/2401.13697.
[83] HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[EB/OL]. [2024-05-14]. https://arxiv. org/abs/2109.00412.
[84] HAZARIKA D, LI Y T, CHENG B, et al. Analyzing modality robustness in multimodal sentiment analysis[EB/OL]. [2024-05-14]. https://arxiv.org/abs/2205.15465.
[85] WU S X, DAI D M, QIN Z W, et al. Denoising bottleneck with mutual information maximization for video multimodal fusion[EB/OL]. [2024-05-14]. https://arxiv.org/abs/2305. 14652.
[86] MAO H S, ZHANG B Z, XU H, et al. Robust-MSA: understanding the impact of modality noise on multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(13): 16458-16460.
[87] ZHANG H, WANG Y, YIN G, et al. Learning language-guided. adaptive hyper-modality representation for multimodal sentiment analysis[EB/OL]. [2024-05-14]. https://arxiv.org/abs/2310.05804.