计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (9): 2476-2486.DOI: 10.3778/j.issn.1673-9418.2307045

• 人工智能·模式识别 • 上一篇    下一篇

基于跨模态语义信息增强的多模态情感分析

李梦云,张景,张换香,张晓琳,刘璐瑶   

  1. 1. 内蒙古科技大学 信息工程学院,内蒙古 包头 014010
    2. 内蒙古科技大学 理学院,内蒙古 包头 014010
    3. 内蒙古科技大学 创新创业教育学院,内蒙古 包头 014010
    4. 上海大学 计算机工程与科学学院,上海 200444
  • 出版日期:2024-09-01 发布日期:2024-09-01

Multimodal Sentiment Analysis Based on Cross-Modal Semantic Information Enhancement

LI Mengyun, ZHANG Jing, ZHANG Huanxiang, ZHANG Xiaolin, LIU Luyao   

  1. 1. School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China
    2. School of Science, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China
    3. School of Innovation and Entrepreneurship Education, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China
    4. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
  • Online:2024-09-01 Published:2024-09-01

摘要: 随着社交网络的发展,人类通过不同的方式表达自己的情感,包括文本、视觉和语音,即多模态。针对以往的多模态情感分析方法未能有效地获取多模态情感特征表示,以及没有充分考虑在多模态特征融合过程中冗余信息对实验的影响,提出了一种基于跨模态语义信息增强的多模态情感分析模型。该模型采用BiLSTM网络挖掘各单模态内部存在的上下文信息。通过跨模态信息交互机制对多种模态间的信息交互进行建模,得到文本对语音、视觉,语音对文本、视觉,视觉对文本、语音六种信息交互特征,将目标模态相同的信息交互特征进行拼接,得到信息增强后的单模态特征向量,有效地获取模态间共享和补充的深度语义特征。另外,使用多头自注意力机制分别计算原始单模态特征向量和信息增强后的单模态特征向量间存在的语义相关性,提高识别关键情感特征的能力,降低冗余信息对情感分析的负面干扰。在公共数据集CMU-MOSI和CMU-MOSEI的实验结果表明,所提出的模型既能增强情感特征表示,也能有效降低冗余信息的干扰,在多模态情感分类准确率和泛化能力上的表现优于相关工作。

关键词: 多模态情感分析, 信息增强, 信息交互, 多头注意力机制, 特征融合

Abstract: With the development of social networks, humans express their emotions in different ways, including text, vision and speech, i.e., multimodal. In response to the failure of previous multimodal sentiment analysis methods to effectively obtain multimodal sentiment feature representations and the failure to fully consider the impact of redundant information on experiments during multimodal feature fusion, a multimodal sentiment analysis model based on cross-modal semantic information enhancement is proposed. Firstly, the model adopts BiLSTM network to mine the contextual information within each unimodal mode. Secondly, the information interaction between multiple modalities is modeled through the cross-modal information interaction mechanism to obtain six kinds of information interaction features, namely, text-to-speech and vision, speech-to-text and vision, and vision-to-text and speech, and then the same information interaction features of the target modalities are spliced together to obtain the information-enhanced unimodal feature vectors, which can efficiently obtain the shared and complementary in-depth semantic features between modalities. In addition, the semantic correlations between the original unimodal feature vectors and the information-enhanced unimodal feature vectors are computed separately using the multi-head self-attention mechanism, which improves the ability of identifying the key sentiment features and reduces the negative interference of the redundant information on the sentiment analysis. Experimental results on the public datasets CMU-MOSI (CMU multimodal opinion level sentiment intensity) and CMU-MOSEI (CMU multimodal opinion sentiment and emotion intensity) show that the proposed model can both enhance sentiment feature representation and effectively reduce the interference of redundant information, and it outperforms related works in terms of multimodal sentiment classification accuracy and generalization ability.

Key words: multimodal emotional analysis, information augmentation, information interaction, multi-head attention mechanism, feature fusion