Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

Review on key techniques of video multimodal sentiment analysis

DUAN Zongtao,  HUANG Junchen,  ZHU Xiaole   

  1. School of Information Engineering, Chang’an University, Xi’an 710018, China

视频多模态情感分析关键技术研究综述

段宗涛, 黄俊臣, 朱晓乐   

  1. 长安大学 信息工程学院,西安 710018

Abstract: Sentiment analysis is the process of automatically determining an opinion holder's attitude or emotional tendency. It is widely used in business, social media analysis, and public opinion monitoring. In unimodal sentiment analysis, most researchers use text, facial expressions, and audio information. With the development of deep learning technology, sentiment analysis has expanded from a unimodal to a multimodal field. Combining multiple modalities can address the limitations of a unimodal and understand the emotions expressed by people more accurately and comprehensively. This paper summarizes the critical techniques of multimodal sentiment analysis based on three kinds of unimodal sentiment analysis. Firstly, the multimodal sentiment analysis background and its research status are briefly introduced. Secondly, the relevant data sets that are commonly used are summarized. Then, we describe the unimodal sentiment analysis based on text, facial expression, and audio information. In addition, the critical techniques of video multimodal sentiment analysis, including multimodal fusion, alignment and modal noise processing, and provides a detailed analysis of the relationships between these techniques and their applications. Additionally, the performance metrics of different models on three commonly used datasets were analyzed, further validating the effectiveness of these key techniques. Finally, the existing challenges in multimodal sentiment analysis and future development trends were discussed.

Key words: sentiment analysis, multi-modal, modal fusion, modal alignment, modal noise

摘要: 情感分析是自动判定观点持有者所表现的态度或情绪倾向性的过程,其在商业、社交媒体分析和舆情监测等领域得到了广泛应用。在单一模态情感分析中,多数研究者使用文本、面部表情和音频信息来进行分析。然而,随着深度学习技术的快速发展,情感分析逐渐从单一模态扩展至多模态领域,综合多种模态,能够克服单一模态存在的局限性并更加准确和全面地理解人们所表达的情感。以三种单模态情感分析为基础对多模态情感分析中的关键技术进行了综述:首先简要介绍了多模态情感分析的背景和目前的研究现状;其次总结了常用的相关数据集;然后分别对基于文本、面部表情和音频信息的单模态情感分析进行了简要叙述;此外重点梳理了视频多模态情感分析中的关键技术包括多模态融合、对齐和模态噪声处理的技术,并对这些技术的关系与应用进行了详细分析;同时,对不同模型在三种常用数据集上的性能指标进行了分析,进一步验证了关键技术的有效性。最后,讨论了多模态情感分析现存问题和未来的发展趋势。

关键词: 情感分析, 多模态, 模态融合, 模态对齐, 模态噪声