Sentiment Analysis Combining Dynamic Gradient and Multi-view Co-attention

doi:10.3778/j.issn.1673-9418.2301042

Abstract

Abstract: Aiming at the problems of unbalanced inter-modal optimization and inadequate fusion of multimodal features in multimodal sentiment analysis, a multimodal sentiment analysis model combining dynamic gradient mechanism and multi-view co-attention mechanism (DG-MCM) is proposed, which can effectively mine single-modal representation and fully integrate multimodal information. Firstly, the model uses pre-trained model BERT (bidirectional encoder representation from transformers) and stacked long short-term memory (SLSTM) to learn the features of text, audio and video, and proposes a dynamic gradient mechanism. By monitoring the contribution difference and learning speed of each mode, the feature learning of each mode is assisted. Secondly, the features of different modes obtained are fused using the multi-view co-attention mechanism. By projecting every two modes into multiple spaces for interaction, more adequate fusion features are obtained. Finally, fusion features and single-modal features are spliced together for sentiment prediction. Experimental results on CMU-MOSI and CMU-MOSEI datasets show that this model can fully learn information between single mode and different modes, and effectively improve the accuracy of multimodal sentiment analysis.

Key words: sentiment analysis, multimodal, attention mechanism, feature fusion

摘要： 针对多模态情感分析中模态间优化不平衡和多模态特征融合不充分的问题，提出一种融合动态梯度机制和多视图协同注意力机制的多模态情感分析模型（DG-MCM），能够有效挖掘单模态特征并充分融合多模态信息。首先，模型使用预训练模型BERT和堆叠式长短期记忆神经网络（SLSTM）学习文本、音频和视频的特征，并提出动态梯度机制，通过监测各模态对学习目标的贡献差异和学习速度辅助各模态的特征学习。其次，将得到的不同模态的特征使用多视图协同注意力机制进行特征融合，通过将每两个模态投影到多个空间执行交互获得更加充分的融合特征。最后，拼接融合特征和单模态特征进行情感预测。在CMU-MOSI和CMU-MOSEI数据集的实验结果表明，该模型能够充分学习单模态和不同模态之间的信息，有效提升多模态情感分析的准确率。

关键词: 情感分析, 多模态, 注意力机制, 特征融合

WANG Xiang, MAO Li, CHEN Qidong, SUN Jun. Sentiment Analysis Combining Dynamic Gradient and Multi-view Co-attention[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(5): 1328-1338.

王香, 毛力, 陈祺东, 孙俊. 融合动态梯度和多视图协同注意力的情感分析[J]. 计算机科学与探索, 2024, 18(5): 1328-1338.

References

[1] XU G, YU Z, YAO H, et al. Chinese text sentiment analysis based on extended sentiment dictionary[J]. IEEE Access, 2019, 7: 43749-43762.
[2] ZHOU C, SUN C, LIU Z, et al. A C-LSTM neural network for text classification[J]. arXiv:1511.08630, 2015.
[3] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J]. arXiv:1606.06259, 2016.
[4] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv:1707.07250, 2017.
[5] ZHANG X, CHEN Y, LI G. Multi-modal sarcasm detection based on contrastive attention mechanism[C]//Proceedings of the 2021 CCF International Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2021: 822-833.
[6] BAHDANAU D, CHO K, BENGIO Y. Neural machine trans-lation by jointly learning to align and translate[J]. arXiv:1409.0473, 2014.
[7] WANG Y, HUANG M, ZHU X, et al. Attention-based LSTM for aspect-level sentiment classification[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2016: 606-615.
[8] LIU G, GUO J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification[J]. Neurocomputing, 2019, 337: 325-338.
[9] HUANG F, ZHANG X, ZHAO Z, et al. Image-text sentiment analysis via deep multimodal attentive fusion[J]. Know-ledge-Based Systems, 2019, 167: 26-37.
[10] TRUONG Q T, LAUW H W. VistaNet: visual aspect attention network for multimodal sentiment analysis[C]//Proceedings of the 2019 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2019: 305-312.
[11] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016: 20-36.
[12] WANG W, TRAN D, FEISZLI M. What makes training multi-modal classification networks hard?[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 12692-12702.
[13] SUN Y, MAI S, HU H. Learning to balance the learning rates between various modalities via adaptive tracking factor[J]. IEEE Signal Processing Letters, 2021, 28: 1650-1654.
[14] ISMAIL A A, HASAN M, ISHTIAQ F. Improving multimodal accuracy through modality pre-training and attention [J]. arXiv:2011.06102, 2020.
[15] XIONG C, ZHONG V, SOCHER R. Dynamic coattention networks for question answering[J]. arXiv:1611.01604, 2016.
[16] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[17] PENG Z, LU Y, PAN S, et al. Efficient speech emotion recognition using multi-scale CNN and attention[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 3020-3024.
[18] PORIA S, PENG H, HUSSAIN A, et al. Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis[J]. Neurocomputing, 2017, 261: 217-230.
[19] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the 2018 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2018: 5634-5641.
[20] LIANG P P, LIU Z, ZADEH A, et al. Multimodal language analysis with recurrent multistage fusion[J]. arXiv:1808.03920, 2018.
[21] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[J]. arXiv:1806.00064, 2018.
[22] MAI S, XING S, HU H. Locally confined modality fusion network with a global perspective for multimodal human affective computing[J]. IEEE Transactions on Multimedia, 2019, 22(1): 122-137.
[23] WINTERBOTTOM T, XIAO S, MCLEAN A, et al. On modality bias in the TVQA dataset[J]. arXiv:2012.10210, 2020.
[24] DU C, LI T, LIU Y, et al. Improving multi-modal learning with uni-modal teachers[J]. arXiv:2106.11059, 2021.
[25] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[26] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2013: 6645-6649.
[27] ZADEH A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2236-2246.
[28] TSAI Y H H, LIANG P P, ZADEH A, et al. Learning factorized multimodal representations[J]. arXiv:1806.06176, 2018.
[29] TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 6558-6569.