Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (5): 1318-1327.DOI: 10.3778/j.issn.1673-9418.2311004

• Artificial Intelligence·Pattern Recognition • Previous Articles     Next Articles

Temporal Multimodal Sentiment Analysis with Composite Cross Modal Interaction Network

YANG Li, ZHONG Junhong, ZHANG Yun, SONG Xinyu   

  1. School of Computer Science and Software Engineering, Southwest Petroleum University, Chengdu 610500, China
  • Online:2024-05-01 Published:2024-04-29

基于复合跨模态交互网络的时序多模态情感分析

杨力,钟俊弘,张赟,宋欣渝   

  1. 西南石油大学 计算机与软件学院,成都 610500

Abstract: To address the issues of insufficient modal fusion and weak interactivity caused by semantic feature differences between different modalities in multimodal emotion analysis, a temporal multimodal sentiment analysis model for composite cross modal interaction network (CCIN-SA) is constructed by studying and analyzing the potential correlations between different modalities. The model first uses a bidirectional gated loop unit and a multi-head attention mechanism to extract temporal features of text, visual, and speech modalities with contextual semantic information. Then, a cross modal attention interaction layer is designed to continuously strengthen the target mode using low order signals from auxiliary modes, enabling the target mode to learn information from auxiliary modes and capture potential adaptability between modes. Then it inputs the enhanced features into the composite feature fusion layer, further captures the similarity between different modalities through condition vectors, enhances the correlation degree of important features, and mines deeper level interactivity between modalities. Finally, using a multi-head attention mechanism, the composite cross modal enhanced features are concatenated and fused with low order signals to increase the weight of important features within the modality, preserve the unique feature information of the initial modality, and perform the final emotion classification task on the obtained multimodal fused features. The model evaluation is conducted on the CMU-MOSI and CMU-MOSEI datasets, and the results show that the model is improved in accuracy and F1 metrics compared with other existing models. It can be seen that the CCIN-SA model can effectively explore the correlation between different modalities and make more accurate emotional judgments.

Key words: cross modal interaction, attention mechanism, feature fusion, composite fusion layer, multimodal emotional analysis

摘要: 针对多模态情感分析中存在的不同模态间语义特征差异性导致模态融合不充分、交互性弱等问题,通过研究分析不同模态之间存在的潜在关联性,搭建一种基于复合跨模态交互网络的时序多模态情感分析(CCIN-SA)模型。该模型首先使用双向门控循环单元和多头注意力机制提取具有上下文语义信息的文本、视觉和语音模态时序特征;然后,设计跨模态注意力交互层,利用辅助模态的低阶信号不断强化目标模态,使得目标模态学习到辅助模态的信息,捕获模态间的潜在适应性;再将增强后的特征输入到复合特征融合层,通过条件向量进一步捕获不同模态间的相似性,增强重要特征的关联程度,挖掘模态间更深层次的交互性;最后,利用多头注意力机制将复合跨模态强化后的特征与低阶信号做拼接融合,提高模态内部重要特征的权重,保留初始模态独有的特征信息,将得到的多模态融合特征进行最终的情感分类任务。在CMU-MOSI和CMU-MOSEI数据集上进行模型评估,结果表明,CCIN-SA模型相比其他现有模型在准确率和F1指标上均有提高,能够有效挖掘不同模态间的关联性,做出更加准确的情感判断。

关键词: 跨模态交互, 注意力机制, 特征融合, 复合融合层, 多模态情感分析