Multimodal Sentiment Analysis with Composite Hierarchical Fusion

doi:10.3778/j.issn.1673-9418.2111004

Abstract

Abstract: Aiming at the problem that traditional sentiment analysis methods cannot solve the problem of short video sentiment expression and the problem that existing multimodal sentiment analysis methods have low accuracy and poor interaction between different modal information, a multimodal sentiment analysis model with composite hierarchical fusion is established by studying multimodal sentiment analysis methods and combining temporal convolutional network (TCN) and Soft-attention mechanism. The model first dimensionally equalizes the text features, video facial features and audio features extracted from the video, and then fuses the obtained information features between different modalities in a composite manner (The model first fuses the unimodal information to obtain bimodal feature information, then fuses the obtained three bimodal information to obtain the final trimodal information. The final trimodal information and each unimodal information are fused to obtain the final multimodal sentiment feature information), and each fused feature information is then passed through the TCN network layer for sequence feature extraction, and the final multimodal feature information is used for sentiment classification after information filtering and feature dimensionality reduction through the screening attention mechanism to obtain the prediction results. Experiments on the datasets CMU-MOSI (CMU multimodal opinion level sentiment intensity) and CMU-MOSEI (CMU multimodal opinion sentiment and emotion intensity) show that the model can make full use of the interaction information between different modalities, so that the accuracy of multimodal sentiment analysis can be effectively improved.

Key words: temporal convolutional networks, feature fusion, multimodal sentiment analysis, attention mechanism

摘要： 针对传统情感分析方法无法解决短视频情感表达问题以及现有多模态情感分析方法准确率不高、不同模态信息之间交互性差等问题，通过对多模态情感分析方法进行研究，结合时域卷积网络（TCN）和软注意力机制建立了复合层次融合的多模态情感分析模型。该模型首先将视频中提取到的文本特征、视频面部特征和音频特征进行维度均衡，然后将得到的不同模态的信息特征进行复合式融合，即先将单模态信息进行融合得到双模态特征信息，再将得到的三个双模态信息进行融合，得到最终的三模态信息，最后将得到的三模态信息和每个单模态信息进行融合得到最终的多模态情感特征信息。每次融合的特征信息都经过TCN网络层进行序列特征的提取，将最终得到的多模态特征信息通过注意力机制进行筛选过滤后用于情感分类，从而得到预测结果。在数据集CMU-MOSI和CMU-MOSEI上的实验表明，该模型能够充分利用不同模态间的交互信息，有效提升多模态情感分析的准确率。

关键词: 时域卷积网络, 特征融合, 多模态情感分析, 注意力机制

WANG Xuyang, DONG Shuai, SHI Jie. Multimodal Sentiment Analysis with Composite Hierarchical Fusion[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 198-208.

王旭阳, 董帅, 石杰. 复合层次融合的多模态情感分析[J]. 计算机科学与探索, 2023, 17(1): 198-208.

References

[1] 张亚洲, 戎璐, 宋大为, 等. 多模态情感分析研究综述[J]. 模式识别与人工智能, 2020, 33(5): 426-438.
ZHANG Y Z, RONG L, SONG D W, et al. A survey on multimodal sentiment analysis[J]. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 426-438.
[2] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multi-modal corpus of sentiment intensity and subjectivity analy-sis in online opinion videos[J]. arXiv:1606.06259, 2016.
[3] PORIA S, CAMBRIA E, HAZARIKA D, et al. Multi-level multiple attentions for contextual multimodal sentiment analysis[C]//Proceedings of the 2017 IEEE International Conference on Data Mining, New Orleans, Nov 18-21, 2017. Piscataway: IEEE, 2017: 1033-1038.
[4] 刘继明, 张培翔, 刘颖, 等. 多模态的情感分析技术综述[J]. 计算机科学与探索, 2021, 15(7): 1165-1182.
LIU J M, ZHANG P X, LIU Y, et al. Summary of multi-modal sentiment analysis technology[J]. Journal of Fron-tiers of Computer Science and Technology, 2021, 15(7): 1165-1182.
[5] NOJAVANASGHARI B, GOPINATH D, KOUSHIK J, et al. Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Confe-rence on Multimodal Interaction, Tokyo, Nov 12-16, 2016. New York: ACM, 2016: 284-288.
[6] WOLLMER M, WENINGER F, KNAUP T, et al. YouTube movie reviews: sentiment analysis in an audio-visual con-text[J]. IEEE Intelligent Systems, 2013, 28(3): 46-53.
[7] KIM Y. Convolutional neural networks for sentence classi-fication[C]//Proceedings of the 2014 Conference on Empiri-cal Methods in Natural Language Processing, Doha, Oct 25-29, 2014. Stroudsburg: ACL, 2014: 1746-1751.
[8] TANG D, QIN B, LIU T. Document modeling with gated recurrent neural network for sentiment classification[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Sep 17-21, 2015. Stroudsburg: ACL, 2015: 1422-1432.
[9] HOCHREITER S, SCHMIDHUBER J. Long short-term me-mory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[10] CAMBRIA E, HAZARIKA D, PORIA S, et al. Bench-marking multimodal sentiment analysis[C]//LNCS 10762: Proceedings of the 18th International Conference on Com-putational Linguistics and Intelligent Text Processing, Bu-dapest, Apr 17-23, 2017. Cham: Springer, 2017: 166-179.
[11] WILLIAMS J, KLEINEGESSE S, COMANESCU R, et al. Recognizing emotions in video using multimodal DNN feature fusion[C]//Proceedings of the 2018 Grand Challenge and Workshop on Human Multimodal Language, Melbou-rne, Jul 20, 2018. Stroudsburg: ACL, 2018: 11-19.
[12] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Pro-ceedings of the 32nd AAAI Conference on Artificial Intelli-gence, the 30th Innovative Applications of Artificial Intelli-gence, and the 8th AAAI Symposium on Educational Ad-vances in Artificial Intelligence, New Orleans, Feb 2-7, 2018. Menlo Park: AAAI, 2018: 5634-5641.
[13] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Sep 9-11, 2017. Strouds-burg: ACL, 2017: 1103-1114.
[14] LIU Z, YING S, BHARADHWAJ V A, et al. Efficient low-rank multimodal fusion with modality-specific factors[J]. arXiv:1806.00064, 2018.
[15] TSAI Y H,BAI S J, LINAG P P, et al. Multimodal trans-former for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Jul 28-Aug 2, 2019. Stroudsburg: ACL, 2019: 6558-6569.
[16] SHENOY A, SARDANA A. Multilogue-Net: a context aware RNN for multi-modal emotion detection and sentiment analysis in conversation[J]. arXiv:2002.08267, 2020.
[17] HAZARIKA D, ZIMMERMANN R, PORIA S, et al. MISA: modality-invariant and -specific representations for multi-modal sentiment analysis[J]. arXiv:2005.03545, 2020.
[18] MAJUMDER N, HAZARIKA, GELBUKH E, et al. Multi-modal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124-133.
[19] BAHDANUA D, CHO K, BENGIO Y. Neural machine trans-lation by jointly learning to align and translate[J]. arXiv:1409.0473, 2014.
[20] ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages[J]. IEEE Intelligent Systems, 2016, 31(6): 82-88.
[21] ZADEH A, LIANG P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Jul 15-20, 2018. Stroudsburg: ACL, 2018: 2236-2246.
[22] DEVLIN J, MING W C, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understan-ding[J]. arXiv:1810.04805, 2018.
[23] MCFEE B, RAFFEL C, LIANG D, et al. LibROSA: audio and music signal analysis in Python[C]//Proceedings of the 14th Python in Science Conference, Austin, Jul 6-12, 2015: 18-25.
[24] ZHANG W L, LI R J, TAO Z, et al. Deep model based transfer and multi-task learning for biological image analy-sis[J]. IEEE Transactions on Big Data, 2016, 6(2): 322-333.
[25] BALTRUSAITIS T, ZADEH A, LIM Y C, et al. OpenFace 2.0: facial behavior analysis toolkit[C]//Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, Xi??an, May 15-19, 2018. Washington: IEEE Computer Society, 2018: 59-66.
[26] KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. arXiv:1412.6980, 2014.
[27] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: a simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15(1): 1929-1958.