Bimodal Interactive Attention for Multimodal Sentiment Analysis

doi:10.3778/j.issn.1673-9418.2105071

Abstract

Abstract:

Aiming at the problems of low accuracy of sentiment classification and difficulty in effectively fusing multimodal features in existing multimodal sentiment analysis methods, a multimodal sentiment analysis model combining context and bimodal interactive attention is established by analyzing the dependence between adjacent utterances and the interaction among text, audio and video modalities. Firstly, the model adopts a bidirectional gated recurrent unit (BiGRU) to capture the dependence between utterances in each modal, and the context information of each modal is obtained. In order to learn the interactive information between different modalities, a bimodal interactive attention mechanism is proposed to fuse the information of the two modalities and it is used as a condition vector to distinguish the importance of each modal for sentiment classification. Then, self-attention and fully connected layers are combined to form a multimodal feature fusion module, the correlation of information within and between modalities is mined, and cross-modal joint features are obtained. Finally, the obtained contextual features and cross-modal joint features are spliced, and then fed to Softmax for the final sentiment classification after a fully connected layer. The proposed model is evaluated on the public multimodal sentiment analysis dataset CMU-MOSI (CMU multimodal opinion-level sentiment intensity). The experimental results show that compared with the existing models, the performance of the proposed model on multimodal sentiment classification task is effective and advanced.

Key words: multimodal, sentiment analysis, bidirectional gated recurrent unit (BiGRU), context, bimodal interactive attention, feature fusion

摘要：

针对现有多模态情感分析方法中存在情感分类准确率不高,难以有效融合多模态特征等问题,通过研究分析相邻话语之间的依赖关系和文本、语音和视频模态之间的交互作用,建立一种融合上下文和双模态交互注意力的多模态情感分析模型。该模型首先采用双向门控循环单元（BiGRU）捕获各模态中话语之间的相互依赖关系,得到各模态的上下文信息。为了学习不同模态之间的交互信息,提出了一种双模态交互注意力机制来融合两种模态的信息,并将其作为条件向量来区分各模态信息对于情感分类的重要程度;然后结合自注意力、全连接层组成多模态特征融合模块,挖掘模态内部和模态之间的关联性,获得跨模态联合特征。最后,将得到的上下文特征和跨模态联合特征进行拼接,经过一层全连接层后馈送至Softmax进行最终的情感分类。在公开的多模态情感分析数据集CMU-MOSI上对所提出的模型进行评估,实验结果表明,相比现有模型,该模型在多模态情感分类任务上的表现是有效的和先进的。

关键词: 多模态, 情感分析, 双向门控循环单元（BiGRU）, 上下文, 双模态交互注意力, 特征融合

CLC Number:

TP391

BAO Guangbin, LI Gangle, WANG Guoxiong. Bimodal Interactive Attention for Multimodal Sentiment Analysis[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 909-916.

包广斌, 李港乐, 王国雄. 面向多模态情感分析的双模态交互注意力[J]. 计算机科学与探索, 2022, 16(4): 909-916.

Figures/Tables 10

Fig.1 Effect of text, facial expressions and voice intonation on sentiment classification

Fig.2 Model structure combining context and bimodal interactive attention

Fig.3 BiGRU structure model diagram

Fig.4 Structure diagram of bimodal interactive attention (Bim_Att)

Table 1 Experimental parameter setting

参数	值
词向量维度	50
BiGRU隐藏单元	300
全连接层神经元	100
Dropout	0.5
学习率	0.001
批处理	32
迭代次数	30
优化函数	Adam
损失函数	Categorical_crossentropy

Table 2 Experimental results on MOSI dataset %

Model	Accuracy	$F 1$ score
GME-LSTM^[9]	76.50	73.40
MARN^[17]	77.10	77.00
TFN^[11]	77.10	77.90
Dialogue-RNN^[14]	79.80	79.48
BC-LSTM^[13]	80.30	—
Multilogue-Net^[15]	81.19	80.10
Con-BIAM	81.91	85.40

Table 2 Experimental results on MOSI dataset %

Model	Accuracy	$F 1$ score
GME-LSTM^[9]	76.50	73.40
MARN^[17]	77.10	77.00
TFN^[11]	77.10	77.90
Dialogue-RNN^[14]	79.80	79.48
BC-LSTM^[13]	80.30	—
Multilogue-Net^[15]	81.19	80.10
Con-BIAM	81.91	85.40

Fig.5 Con-BIAM model confusion matrix on MOSI dataset

Table 3 Accuracy of different models in bimodal and trimodal feature fusion %

Metric	Dialogue-RNN^[14]	Multilogue-Net^[15]	Con-BIAM
T+A	79.80	80.18	80.45
V+T	78.90	80.06	80.98
A+V	73.90	75.16	63.96
A+V+T	79.80	81.19	81.91

Table 4 F 1scores of different models in bimodal and trimodal feature fusion %

Metric	Dialogue-RNN^[14]	Multilogue-Net^[15]	Con-BIAM
T+A	78.32	79.88	84.14
V+T	78.12	79.84	84.43
A+V	73.92	74.04	75.20
A+V+T	79.48	80.10	85.40

Fig.6 Comparative experiment on MOSI dataset

References 21

[1]	GHOSAL D, AKHTAR M S, CHAUHAN D S, et al. Contextual inter-modal attention for multi-modal sentiment analysis[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Oct 31-Nov 4, 2018. Stroudsburg: ACL, 2018: 3454-3466.
[2]	林敏鸿, 蒙祖强. 基于注意力神经网络的多模态情感分析[J]. 计算机科学, 2020, 47(S2):508-514.
	LIN M H, MENG Z Q. Multimodal sentiment analysis based on attention neural network[J]. Computer Science, 2020, 47(S2):508-514.
[3]	刘继明, 张培翔, 刘颖, 等. 多模态的情感分析技术综述[J]. 计算机科学与探索, 2021, 15(7):1165-1182.
	LIU J M, ZHANG P X, LIU Y, et al. Summary of multi-modal sentiment analysis technology[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(7):1165-1182.
[4]	何俊, 张彩庆, 李小珍, 等. 面向深度学习的多模态融合技术研究综述[J]. 计算机工程, 2020, 46(5):1-11.
	HE J, ZHANG C Q, LI X Z, et al. Survey of research on multimodal fusion technology for deep learning[J]. Computer Engineering, 2020, 46(5):1-11.
[5]	PORIA S, CAMBRIA E, HAZARIKA D, et al. Multi-level multiple attentions for contextual multimodal sentiment analysis[C]// Proceedings of the 2017 IEEE International Conference on Data Mining, New Orleans, Nov 18-21, 2017. Washington: IEEE Computer Society, 2017: 1033-1038.
[6]	KUMAR A, VEPA J. Gated mechanism for attention based multi modal sentiment analysis[C]// Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway:IEEE, 2020: 4477-4481.
[7]	LIN Z, FENG M, SANTOS C N, et al. A structured self-attentive sentence embedding[J]. arXiv: 1703. 03130, 2017.
[8]	ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal senti-ment intensity analysis in videos: facial gestures and verbal messages[J]. IEEE Intelligent Systems, 2016, 31(6):82-88. DOI URL
[9]	CHEN M H, WANG S, LIANG P P, et al. Multimodal sentiment analysis with word-level fusion and reinforcement learning[C]// Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, Nov 13-17, 2017. New York: ACM, 2017: 163-171.
[10]	张亚洲, 戎璐, 宋大为, 等. 多模态情感分析研究综述[J]. 模式识别与人工智能, 2020, 33(5):426-438.
	ZHANG Y Z, RONG L, SONG D W, et al. A survey on multimodal sentiment analysis[J]. Pattern Recognition and Artificial Intelligence, 2020, 33(5):426-438.
[11]	ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Sep 9-11, 2017. Stroudsburg: ACL, 2017: 1103-1114.
[12]	ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Jul 15-20, 2018. Stroudsburg: ACL, 2018: 2236-2246.
[13]	PORIA S, CAMBRIA E, HAZARIKA D, et al. Context-dependent sentiment analysis in user-generated videos[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Jul 30-Aug 4, 2017. Stroudsburg: ACL, 2017: 873-883.
[14]	MAJUMDER N, PORIA S, HAZARIKA D, et al. Dialogue: an attentive RNN for emotion detection in conversations[C]// Proceedings of the 2019 AAAI Conference on Artificial Intelligence, Honolulu, Jan 27-Feb 1, 2019. Palo Alto: AAAI, 2019: 6818-6825.
[15]	SHENOY A, SARDANA A. Multilogue-Net: a context aware RNN for multi-modal emotion detection and sentiment analysis in conversation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, Jul 5-10, 2020. Stroudsburg: ACL, 2020: 19-28.
[16]	KIM T, LEE B. Multi-attention multimodal sentiment analysis[C]// Proceedings of the 2020 International Conference on Multimedia Retrieval, Dublin, Jun 8-11, 2020. New York: ACM, 2020: 436-441.
[17]	ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]// Proceedings of the 2018 AAAI Conference on Artificial Intelligence, New Orleans, Feb 2-7, 2018. Palo Alto: AAAI, 2018: 5642-5649.
[18]	XI C, LU G M, YAN J J. Multimodal sentiment analysis based on multi-head attention mechanism[C]// Proceedings of the 4th International Conference on Machine Learning and Soft Computing, Haiphong City, Jan 17-19, 2020. New York: ACM, 2020: 34-39.
[19]	VERMA S, WANG J W, GE Z F, et al. Deep-HOSeq: deep higher order sequence fusion for multimodal sentiment analysis[J]. arXiv: 2010. 08218, 2020.
[20]	TACHIBANA H, UENOYAMA K, AIHARA S. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention[C]// Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscataway: IEEE, 2018: 4784-4788.
[21]	EYBEN F, WÖLLMER M, SCHULLER B W. Opensmile: the munich versatile and fast open-source audio feature extractor[C]// Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Oct 25-29, 2010. New York: ACM, 2010: 1459-1462.