计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (1): 198-208.DOI: 10.3778/j.issn.1673-9418.2111004

• 人工智能·模式识别 • 上一篇    下一篇

复合层次融合的多模态情感分析

王旭阳,董帅,石杰   

  1. 兰州理工大学 计算机与通信学院,兰州 730050
  • 出版日期:2023-01-01 发布日期:2023-01-01

Multimodal Sentiment Analysis with Composite Hierarchical Fusion

WANG Xuyang, DONG Shuai, SHI Jie   

  1. School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
  • Online:2023-01-01 Published:2023-01-01

摘要: 针对传统情感分析方法无法解决短视频情感表达问题以及现有多模态情感分析方法准确率不高、不同模态信息之间交互性差等问题,通过对多模态情感分析方法进行研究,结合时域卷积网络(TCN)和软注意力机制建立了复合层次融合的多模态情感分析模型。该模型首先将视频中提取到的文本特征、视频面部特征和音频特征进行维度均衡,然后将得到的不同模态的信息特征进行复合式融合,即先将单模态信息进行融合得到双模态特征信息,再将得到的三个双模态信息进行融合,得到最终的三模态信息,最后将得到的三模态信息和每个单模态信息进行融合得到最终的多模态情感特征信息。每次融合的特征信息都经过TCN网络层进行序列特征的提取,将最终得到的多模态特征信息通过注意力机制进行筛选过滤后用于情感分类,从而得到预测结果。在数据集CMU-MOSI和CMU-MOSEI上的实验表明,该模型能够充分利用不同模态间的交互信息,有效提升多模态情感分析的准确率。

关键词: 时域卷积网络, 特征融合, 多模态情感分析, 注意力机制

Abstract: Aiming at the problem that traditional sentiment analysis methods cannot solve the problem of short video sentiment expression and the problem that existing multimodal sentiment analysis methods have low accuracy and poor interaction between different modal information, a multimodal sentiment analysis model with composite hierarchical fusion is established by studying multimodal sentiment analysis methods and combining temporal convolutional network (TCN) and Soft-attention mechanism. The model first dimensionally equalizes the text features, video facial features and audio features extracted from the video, and then fuses the obtained information features between different modalities in a composite manner (The model first fuses the unimodal information to obtain bimodal feature information, then fuses the obtained three bimodal information to obtain the final trimodal information. The final trimodal information and each unimodal information are fused to obtain the final multimodal sentiment feature information), and each fused feature information is then passed through the TCN network layer for sequence feature extraction, and the final multimodal feature information is used for sentiment classification after information filtering and feature dimensionality reduction through the screening attention mechanism to obtain the prediction results. Experiments on the datasets CMU-MOSI (CMU multimodal opinion level sentiment intensity) and CMU-MOSEI (CMU multimodal opinion sentiment and emotion intensity) show that the model can make full use of the interaction information between different modalities, so that the accuracy of multimodal sentiment analysis can be effectively improved.

Key words: temporal convolutional networks, feature fusion, multimodal sentiment analysis, attention mechanism