Multi-channel Temporal Convolution Fusion for Multimodal Sentiment Analysis

doi:10.3778/j.issn.1673-9418.2309071

Abstract

Abstract: Multimodal sentiment analysis has become a hot research direction in affective computing by extending unimodal analysis to multimodal environments with information fusion. Word-level representation fusion is a key technique for modeling cross-modal interactions by capturing interplay between different modal elements. And word-level representation fusion faces two main challenges: local interactions between modal elements and global interactions along the temporal dimension. Existing methods often adopt attention mechanisms to model correlations between overall features across modalities when modeling local interactions, while ignoring interactions between adjacent elements and local features, and are computationally expensive. To address these issues, a multi-channel temporal convolution fusion (MCTCF) model is proposed, which uses 2D convolutions to obtain local interactions between modal elements. Specifically, local connections can capture associations between neighboring elements, multi-channel convolutions learn to fuse local features across modalities, and weight sharing greatly reduces computations. On the locally fused sequences, temporal LSTM networks further model global correlations along the temporal dimension. Extensive experiments on MOSI and MOSEI datasets demonstrate the efficacy and efficiency of MCTCF. Using just one convolution kernel (three channels, 28 weight parameters), it achieves state-of-the-art or competitive results on many metrics. Ablation studies confirm that both local convolution fusion and global temporal modeling are crucial for the superior performance. In summary, this paper enhances word-level representation fusion through feature interactions, and reduces computational complexity.

Key words: multimodal, sentiment analysis, word-level representation fusion, 2D convolutional neural network

摘要： 多模态情感分析已成为情感计算领域中的热门研究方向，它将基于单模态的情感分析扩展到基于多模态信息交流的环境。词级表示融合是建模跨模态信息交互的关键技术之一，旨在建模不同模态元素之间的相互作用。该任务面临两大挑战：模态元素之间的局部交互和时间维度上的全局交互。现有方法在建模局部交互时，常采用注意力机制刻画模态元素整体特征间的相关性，但忽视了相邻元素及局部特征间的交互作用，计算成本也较高。为解决上述问题，提出一种多通道时序卷积融合（MCTCF）模型，该方法运用二维卷积网络获取多模态元素之间的局部交互。其中，局部连接可捕获相邻元素的关联，多通道卷积可学习多模态元素局部特征之间的融合，权重共享大幅降低了计算量。在得到局部交互后的序列上，时序LSTM网络可进一步建模时间维度上的全局关联。在MOSI和MOSEI数据集上的大量实验证明了MCTCF的有效性与高效性。仅用一个卷积核（三通道，28个权重参数），在许多指标上取得了最先进或具有竞争力的结果。消融研究表明，局部卷积融合和全局时序建模都是提高性能的关键。该研究强化了词级表示融合，降低了计算复杂度。

关键词: 多模态, 情感分析, 词级表示融合, 二维卷积网络

SUN Jie, CHE Wengang, GAO Shengxiang. Multi-channel Temporal Convolution Fusion for Multimodal Sentiment Analysis[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 3041-3050.

孙杰, 车文刚, 高盛祥. 面向多模态情感分析的多通道时序卷积融合[J]. 计算机科学与探索, 2024, 18(11): 3041-3050.

References

[1] ZHU L, ZHU Z, ZHANG C, et al. Multimodal sentiment analysis based on fusion methods: a survey[J]. Information Fusion, 2023, 95: 306-325.
[2] GANDHI A, ADHVARYU K, PORIA S, et al. Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions[J]. Information Fusion, 2023, 91: 424-444.
[3] 刘继明, 张培翔, 刘颖, 等. 多模态的情感分析技术综述[J]. 计算机科学与探索, 2021, 15(7): 1165-1182.
LIU J M, ZHANG P X, LIU Y, et al. Summary of multi-modal sentiment analysis technology[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(7): 1165-1182.
[4] 之江实验室. 情感计算白皮书[EB/OL]. (2022-12-09)[2023-09-08]. https://www.zhejianglab.com/uploadfile/20221208/1670-465654902617.pdf.
Zhejiang Lab. Affective computing[EB/OL]. (2022-12-09)[2023-09-08]. https://www.zhejianglab.com/uplo-adfile/20221208/ 1670465654902617.pdf.
[5] LIANG P P, ZADEH A, MORENCY L P. Foundations and trends in multimodal machine learning: principles, challenges, and open questions[J]. ACM Computing Surveys, 2024, 56(10): 264.
[6] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Sep 9-11, 2017. Stroudsburg: ACL, 2017: 1103-1114.
[7] HAZARIKA D, ZIMMERMANN R, PORIA S, et al. MISA: modality-invariant and -specific representations for multimodal sentiment analysis[EB/OL]. [2023-09-08]. https://arxiv.org/abs/2005.03545.
[8] SUN H, WANG H, LIU J, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation[C]//Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Oct 10-14, 2022. New York: ACM, 2022: 3722-3729.
[9] CHEN M, WANG S, LIANG P P, et al. Multimodal sentiment analysis with word-level fusion and reinforcement learning[C]//Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, Nov 13-17, 2017. New York: ACM, 2017: 163-171.
[10] WANG Y, SHEN Y, LIU Z, et al. Words can shift: dynamically adjusting word representations using nonverbal behaviors[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence and the 31st Innovative Applications of Artificial Intelligence Conference and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, Jan 27-Feb 1, 2019. Menlo Park: AAAI, 2019: 7216-7223.
[11] ABDU S A, YOUSEF A H, SALEM A. Multimodal video sentiment analysis using deep learning approaches: a survey[J]. Information Fusion, 2021, 76: 204-226.
[12] RAHMAN W, HASAN M K, LEE S, et al. Integrating multi-modal information in large pretrained transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 2359-2369.
[13] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the 2018 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2018: 5642-5649.
[14] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, Feb 2-7, 2018. Menlo Park: AAAI, 2018: 5634-5641.
[15] LIANG P P, LIU Z, BAGHER ZADEH A, et al. Multimodal language analysis with recurrent multistage fusion[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Oct 31-Nov 4, 2018. Stroudsburg: ACL, 2018: 150-161.
[16] TSAI Y H, BAI S J, LINAG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Jul 28-Aug 2, 2019. Stroudsburg: ACL, 2019: 6558-6569.
[17] HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 International Conference on Multimodal Interaction, Montréal, Oct 18-22, 2021. New York: ACM, 2021: 6-15.
[18] GUO J, TANG J, DAI W, et al. Dynamically adjust word representations using unaligned multimodal information[C]//Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Oct 10-14, 2022. New York: ACM, 2022: 3394-3402.
[19] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[20] MAI S, HU H, XING S. Divide, conquer and combine: hierarchical feature fusion network with local and global perspectives for multimodal affective computing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Jul 28-Aug 2, 2019. Stroudsburg: ACL, 2019: 481-492.
[21] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL]. [2023-09-08]. https://arxiv.org/abs/1606.06259.
[22] ZADEH A, LIANG P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Jul 15-20, 2018. Stroudsburg: ACL, 2018: 2236-2246.
[23] YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the 2021 AAAI Conference on Artificial Intelligence, Feb 2-9, 2021. Menlo Park: AAAI, 2021: 10790-10797.
[24] YU W, XU H, MENG F, et al. CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 6-8, 2020. Stroudsburg: ACL, 2020: 3718-3727.
[25] LIU Y, YUAN Z, MAO H, et al. Make acoustic and visual cues matter: CH-SIMS v2.0 dataset and AV-Mixup consistent module[C]//Proceedings of the 2022 International Conference on Multimodal Interaction, Bengaluru, Nov 7-11, 2022. New York: ACM, 2022: 247-258.
[26] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[EB/OL]. [2023-09-08]. https://arxiv.org/abs/1806.00064.
[27] HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multi-modal sentiment analysis[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Nov 7-11, 2021. Stroudsburg: ACL, 2021: 9180-9192.
[28] ZOU W, DING J, WANG C. Utilizing BERT intermediate layers for multimodal sentiment analysis[C]//Proceedings of the 2022 IEEE International Conference on Multimedia and Expo, Taipei, China, Jul 18-22, 2022. Piscataway: IEEE, 2022: 1-6.
[29] WILLIAMS J, KLEINEGESSE S, COMANESCU R, et al. Recognizing emotions in video using multimodal DNN feature fusion[C]//Proceedings of the 2018 Grand Challenge and Workshop on Human Multimodal Language, Melbourne, Jul 20, 2018. Stroudsburg: ACL, 2018: 11-19.
[30] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2023-09-08]. https://arxiv.org/abs/1810.04805.
[31] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP—a collaborative voice analysis repository for speech technologies[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, May 4-9, 2014. Piscataway: IEEE, 2014: 960-964.
[32] EKMAN P, ROSENBERG E L. What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (FACS)[M]. New York: Oxford University Press, 2005.
[33] LIN H, ZHANG P, LING J, et al. PS-Mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing & Management, 2023, 60(2): 103229.
[34] KIM K, PARK S. AOBERT: all-modalities-in-one BERT for multimodal sentiment analysis[J]. Information Fusion, 2023, 92: 37-45.