联合双粒度图像信息的多模态方面级情感分析

doi:10.3778/j.issn.1673-9418.2407116

摘要/Abstract

摘要： 多模态方面级情感分析（MABSA）作为一种细粒度情感分析技术，旨在通过整合多种模态的特征数据来提高该领域的精度和效果。现有的多模态方面级情感分析的研究大多集中在文本和图像模态间的跨模态对齐上，忽略了图像的粗细粒度特征信息对MABSA子任务的潜在贡献。为此，提出一种联合双粒度图像信息的多模态方面级情感分析方法（CDGI）。在多模态方面词提取任务中，为增强图像与文本模态的交互，利用ClipCap获取图像的粗粒度特征描述文本，作为图像提示信息，辅助模型预测文本中的方面词及其属性。在多模态方面词情感分类中，为了捕获丰富的图像细粒度情感特征，通过跨模态注意力机制，将带有原始情感语义的图像底层特征与掩码后的文本经过多层深度交互，强化图像特征到文本特征的融合。在两个公共的Twitter数据集和Restaurant+数据集上的实验结果表明，CDGI的表现优于当前的基线模型，验证了图像粗细粒度特征对MABSA子任务不同贡献度的合理性。

关键词: 多模态方面级情感分析, 双粒度图像信息, 多模态交互, 多模态融合, 跨模态注意力

Abstract: Multimodal aspect-based sentiment analysis (MABSA), as a fine-grained sentiment analysis technique, aims to improve the accuracy and effectiveness of the field by integrating feature data from multiple modes. Most of the existing research on multimodal aspect-based sentiment analysis focuses on the cross-modal alignment between text and image modes, ignoring the potential contribution of image coarse-grained feature information to MABSA subtasks. Therefore, a multimodal aspect-based sentiment analysis method combining dual-granularity image information (CDGI) is proposed in this paper. Specifically, in the multimodal aspect term extraction task, in order to enhance the interaction between image and text modes, ClipCap is used to obtain the coarse-grained feature description text of the image, which is used as image prompt information to assist the model to predict the aspect terms and their attributes in the text. In terms of multimodal emotion classification, in order to capture rich fine-grained emotional features of images, the cross-modal attention mechanism is used to interact the underlying features of images with original emotional semantics with the masked text through multiple layers of depth, so as to strengthen the fusion of image features into text features. Experimental results on two public Twitter datasets and Restaurant+ dataset show that CDGI performs better than the current baseline models, which validates the rationality of different contribution degrees of coarse and fine-grained image features to MABSA subtasks.

Key words: multimodal aspect-based sentiment analysis, dual-granularity image information, multimodal interaction, multimodal fusion, cross-modal attention

许威, 张晓琳, 张换香, 张景. 联合双粒度图像信息的多模态方面级情感分析[J]. 计算机科学与探索, 2025, 19(9): 2479-2492.

XU Wei, ZHANG Xiaolin, ZHANG Huanxiang, ZHANG Jing. Combining Dual-Granularity Image Information for Multimodal Aspect-Based Sentiment Analysis[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(9): 2479-2492.

参考文献

[1] JU X C, ZHANG D, XIAO R, et al. Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal relation detection[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 4395-4405.
[2] YANG L, YU J F, ZHANG C Z, et al. Fine-grained sentiment analysis of political tweets with entity-aware multimodal network[C]//Proceedings of the 16th International Conference on Diversity, Divergence, Dialogue. Cham: Springer, 2021: 411-420.
[3] 曾碧卿, 姚勇涛, 谢梁琦, 等. 结合局部感知与多层次注意力的多模态方面级情感分析[J/OL]. 计算机工程 [2024-06-30]. https://link.cnki.net/doi/10.19678/j.issn.1000-3428.0069705.
ZENG B Q, YAO Y T, XIE L Q, et al. Multimodal aspect-based sentiment analysis combining local perception and multi-level attention[J/OL]. Computer Engineering [2024-06-30]. https://link.cnki.net/doi/10.19678/j.issn.1000-3428. 0069705.
[4] ZHOU R, GUO W Y, LIU X M, et al. AoM: detecting aspect-oriented information for multimodal aspect-based sentiment analysis[C]//Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg: ACL, 2023: 8184-8196.
[5] YAN H, DAI J, QIU X, et al. A unified generative framework for aspect-based sentiment analysis[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 2416-2429.
[6] MOKADY R, HERTZ A, BERMANO A H. ClipCap: CLIP prefix for image captioning[EB/OL]. [2024-05-21]. https://arxiv.org/abs/2111.09734.
[7] CHEN Z, QIAN T Y. Relation-aware collaborative learning for unified aspect-based sentiment analysis[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3685-3694.
[8] OH S, LEE D, WHANG T, et al. Deep context-and relation-aware learning for aspect-based sentiment analysis[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 495-503.
[9] XU L, LI H, LU W, et al. Position-aware tagging for aspect sentiment triplet extraction[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 2339-2349.
[10] TANG D, QIN B, FENG X, et al. Effective LSTMs for target-dependent sentiment classification[C]//Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers. Stroudsburg: ACL, 2016: 3298-3307.
[11] 蔡玉舒, 曹扬, 江维, 等. 基于BERT的端到端旅游评论意见挖掘方法[J]. 计算机技术与发展, 2021, 31(9): 118-123.
CAI Y S, CAO Y, JIANG W, et al. End to end opinion mining method based on BERT for tourism comments[J]. Computer Technology and Development, 2021, 31(9): 118-123.
[12] SUN C, HUANG L, QIU X. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2019:380-385.
[13] ZHAO A P, YU Y. Knowledge-enabled BERT for aspect-based sentiment analysis[J]. Knowledge-Based Systems, 2021, 227: 107220.
[14] 张文轩, 殷雁君, 智敏. 用于方面级情感分析的情感增强双图卷积网络[J]. 计算机科学与探索, 2024, 18(1): 217-230.
ZHANG W X, YIN Y J, ZHI M. Affection enhanced dual graph convolution network for aspect based sentiment analysis[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 217-230.
[15] PANG S G, XUE Y, YAN Z H, et al. Dynamic and multi-channel graph convolutional networks for aspect-based sentiment analysis[C]//Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Stroudsburg: ACL, 2021: 2627-2636.
[16] LIANG B, YIN R D, GUI L, et al. Jointly learning aspect-focused and inter-aspect relations with graph convolutional networks for aspect sentiment analysis[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 150-161.
[17] LI R F, CHEN H, FENG F X, et al. Dual graph convolutional networks for aspect-based sentiment analysis[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 6319-6329.
[18] CHEN H, ZHAI Z P, FENG F X, et al. Enhanced multi-channel graph convolutional network for aspect sentiment triplet extraction[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 2974-2985.
[19] 施荣华, 金鑫, 胡超. 基于图注意力网络的方面级别文本情感分析[J]. 计算机工程, 2022, 48(2): 34-39.
SHI R H, JIN X, HU C. Aspect based sentiment analysis of text based on graph attention networks[J]. Computer Engineering, 2022, 48(2): 34-39.
[20] YUAN L, WANG J, YU L C, et al. Graph attention network with memory fusion for aspect-level sentiment analysis[C]//Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2020: 27-36.
[21] LIANG S, WEI W, MAO X L, et al. BiSyn-GAT+: bi-syntax aware graph attention network for aspect-based sentiment analysis[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: ACL, 2022: 1835-1848.
[22] LAZARIDOU A, PHAM N T, BARONI M. Combining language and vision with a multimodal skip-gram model[EB/OL]. [2024-05-21]. https://arxiv.org/abs/1501.02598.
[23] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Learning, 2011: 689-696.
[24] 李梦云, 张景, 张换香, 等. 基于跨模态语义信息增强的多模态情感分析[J]. 计算机科学与探索, 2024, 18(9): 2476-2486.
LI M Y, ZHANG J, ZHANG H X, et al. Multimodal sentiment analysis based on cross-modal semantic information enhancement[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2476-2486.
[25] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[26] HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 9180-9192.
[27] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(12): 10790-10797.
[28] MAI S, HU H, XING S. Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto: AAAI, 2020: 164-172.
[29] CHEN M P, LI X. SWAFN: sentimental words aware fusion network for multimodal sentiment analysis[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 1067-1077.
[30] XIAO L W, WU X J, WU W, et al. Multi-channel attentive graph convolutional network with sentiment fusion for multimodal sentiment analysis[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 4578-4582.
[31] YU J F, JIANG J, XIA R. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 28: 429-439.
[32] KHAN Z, FU Y. Exploiting BERT for multimodal target sentiment classification through input space translation[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 3034-3042.
[33] LING Y, YU J F, XIA R. Vision-language pre-training for multimodal aspect-based sentiment analysis[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 2149-2159.
[34] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086.
[35] XING Y R, SHI Z, MENG Z, et al. KM-BART: knowledge enhanced multimodal BART for visual commonsense generation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 525-535.
[36] LEWIS M, LIU Y H, GOYAL N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 7871-7880.
[37] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2019: 4171-4186.
[38] 祁宣豪, 智敏. 图像处理中注意力机制综述[J]. 计算机科学与探索, 2024, 18(2): 345-362.
QI X H, ZHI M. Review of attention mechanisms in image processing[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(2): 345-362.
[39] QIAN F, HAN J Q, HE Y J, et al. Sentiment knowledge enhanced self-supervised learning for multimodal sentiment analysis[C]//Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg: ACL, 2023: 12966-12978.
[40] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008.
[41] 王宇欣, 方浩宇, 张伟, 等. 注意力机制在情感分析中的应用研究[J]. 计算机技术与发展, 2022, 32(4): 193-199.
WANG Y X, FANG H Y, ZHANG W, et al. Application research of attention mechanism in sentiment analysis[J]. Computer Technology and Development, 2022, 32(4): 193-199.
[42] YU J F, JIANG J. Adapting BERT for target-oriented multimodal sentiment classification[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI, 2019: 5408-5414.
[43] PONTIKI M, GALANIS D, PAVLOPOULOS J, et al. SemEval-2014 task 4: aspect based sentiment analysis[C]//Proceedings of the 8th International Workshop on Semantic Evaluation. Stroudsburg: ACL, 2014: 27-35.
[44] NICHOL A, DHARIWAL P, RAMESH A, et al. Glide: towards photorealistic image generation and editing with text-guided diffusion models[EB/OL]. [2024-05-21]. https://arxiv.org/abs/ 2112.10741.
[45] YU J F, JIANG J, YANG L, et al. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3342-3352.
[46] WU H Q, CHENG S L, WANG J J, et al. Multimodal aspect extraction with region-aware alignment network[C]//Proceedings of the 2020 Natural Language Processing and Chinese Computing Conference. Cham: Springer, 2020: 145-156.
[47] WU Z W, ZHENG C M, CAI Y, et al. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1038-1046.
[48] WANG J H, GAO Y, LI H K. An interactive attention mechanism fusion network for aspect-based multimodal sentiment analysis[C]//Proceedings of the 2023 International Conference on Machine Learning and Cybernetics. Piscataway: IEEE, 2023: 268-275.
[49] HU X R, YAMAMURA M. Hierarchical fusion network with enhanced knowledge and contrastive learning for multimodal aspect-based sentiment analysis on social media[J]. Sensors, 2023, 23(17): 7330.