计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (11): 3041-3050.DOI: 10.3778/j.issn.1673-9418.2309071

• 人工智能·模式识别 • 上一篇    下一篇

面向多模态情感分析的多通道时序卷积融合

孙杰,车文刚,高盛祥   

  1. 昆明理工大学 信息工程与自动化学院,昆明 650500
  • 出版日期:2024-11-01 发布日期:2024-10-31

Multi-channel Temporal Convolution Fusion for Multimodal Sentiment Analysis

SUN Jie, CHE Wengang, GAO Shengxiang   

  1. School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
  • Online:2024-11-01 Published:2024-10-31

摘要: 多模态情感分析已成为情感计算领域中的热门研究方向,它将基于单模态的情感分析扩展到基于多模态信息交流的环境。词级表示融合是建模跨模态信息交互的关键技术之一,旨在建模不同模态元素之间的相互作用。该任务面临两大挑战:模态元素之间的局部交互和时间维度上的全局交互。现有方法在建模局部交互时,常采用注意力机制刻画模态元素整体特征间的相关性,但忽视了相邻元素及局部特征间的交互作用,计算成本也较高。为解决上述问题,提出一种多通道时序卷积融合(MCTCF)模型,该方法运用二维卷积网络获取多模态元素之间的局部交互。其中,局部连接可捕获相邻元素的关联,多通道卷积可学习多模态元素局部特征之间的融合,权重共享大幅降低了计算量。在得到局部交互后的序列上,时序LSTM网络可进一步建模时间维度上的全局关联。在MOSI和MOSEI数据集上的大量实验证明了MCTCF的有效性与高效性。仅用一个卷积核(三通道,28个权重参数),在许多指标上取得了最先进或具有竞争力的结果。消融研究表明,局部卷积融合和全局时序建模都是提高性能的关键。该研究强化了词级表示融合,降低了计算复杂度。

关键词: 多模态, 情感分析, 词级表示融合, 二维卷积网络

Abstract: Multimodal sentiment analysis has become a hot research direction in affective computing by extending unimodal analysis to multimodal environments with information fusion. Word-level representation fusion is a key technique for modeling cross-modal interactions by capturing interplay between different modal elements. And  word-level representation fusion faces two main challenges: local interactions between modal elements and global interactions along the temporal dimension. Existing methods often adopt attention mechanisms to model correlations between overall features across modalities when modeling local interactions, while ignoring interactions between adjacent elements and local features, and are computationally expensive. To address these issues, a multi-channel temporal convolution fusion (MCTCF) model is proposed, which uses 2D convolutions to obtain local interactions between modal elements. Specifically, local connections can capture associations between neighboring elements, multi-channel convolutions learn to fuse local features across modalities, and weight sharing greatly reduces computations. On the locally fused sequences, temporal LSTM networks further model global correlations along the temporal dimension. Extensive experiments on MOSI and MOSEI datasets demonstrate the efficacy and efficiency of MCTCF. Using just one convolution kernel (three channels, 28 weight parameters), it achieves state-of-the-art or competitive results on many metrics. Ablation studies confirm that both local convolution fusion and global temporal modeling are crucial for the superior performance. In summary, this paper enhances word-level representation fusion through feature interactions, and reduces computational complexity.

Key words: multimodal, sentiment analysis, word-level representation fusion, 2D convolutional neural network