计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (9): 2479-2492.DOI: 10.3778/j.issn.1673-9418.2407116

• 人工智能·模式识别 • 上一篇    下一篇

联合双粒度图像信息的多模态方面级情感分析

许威,张晓琳,张换香,张景   

  1. 1. 内蒙古科技大学 数智产业学院,内蒙古 包头 014010
    2. 内蒙古科技大学 创新创业教育学院,内蒙古 包头 014010
    3. 上海大学 计算机工程与科学学院,上海 200444
    4. 内蒙古科技大学 理学院,内蒙古 包头 014010
  • 出版日期:2025-09-01 发布日期:2025-09-01

Combining Dual-Granularity Image Information for Multimodal Aspect-Based Sentiment Analysis

XU Wei, ZHANG Xiaolin, ZHANG Huanxiang, ZHANG Jing   

  1. 1. School of Digital and Intelligence Industry, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China 
    2. School of Innovation and Entrepreneurship Education, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China
    3. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
    4. School of Science, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China
  • Online:2025-09-01 Published:2025-09-01

摘要: 多模态方面级情感分析(MABSA)作为一种细粒度情感分析技术,旨在通过整合多种模态的特征数据来提高该领域的精度和效果。现有的多模态方面级情感分析的研究大多集中在文本和图像模态间的跨模态对齐上,忽略了图像的粗细粒度特征信息对MABSA子任务的潜在贡献。为此,提出一种联合双粒度图像信息的多模态方面级情感分析方法(CDGI)。在多模态方面词提取任务中,为增强图像与文本模态的交互,利用ClipCap获取图像的粗粒度特征描述文本,作为图像提示信息,辅助模型预测文本中的方面词及其属性。在多模态方面词情感分类中,为了捕获丰富的图像细粒度情感特征,通过跨模态注意力机制,将带有原始情感语义的图像底层特征与掩码后的文本经过多层深度交互,强化图像特征到文本特征的融合。在两个公共的Twitter数据集和Restaurant+数据集上的实验结果表明,CDGI的表现优于当前的基线模型,验证了图像粗细粒度特征对MABSA子任务不同贡献度的合理性。

关键词: 多模态方面级情感分析, 双粒度图像信息, 多模态交互, 多模态融合, 跨模态注意力

Abstract: Multimodal aspect-based sentiment analysis (MABSA), as a fine-grained sentiment analysis technique, aims to improve the accuracy and effectiveness of the field by integrating feature data from multiple modes. Most of the existing research on multimodal aspect-based sentiment analysis focuses on the cross-modal alignment between text and image modes, ignoring the potential contribution of image coarse-grained feature information to MABSA subtasks. Therefore, a multimodal aspect-based sentiment analysis method combining dual-granularity image information (CDGI) is proposed in this paper. Specifically, in the multimodal aspect term extraction task, in order to enhance the interaction between image and text modes, ClipCap is used to obtain the coarse-grained feature description text of the image, which is used as image prompt information to assist the model to predict the aspect terms and their attributes in the text. In terms of multimodal emotion classification, in order to capture rich fine-grained emotional features of images, the cross-modal attention mechanism is used to interact the underlying features of images with original emotional semantics with the masked text through multiple layers of depth, so as to strengthen the fusion of image features into text features. Experimental results on two public Twitter datasets and Restaurant+ dataset show that CDGI performs better than the current baseline models, which validates the rationality of different contribution degrees of coarse and fine-grained image features to MABSA subtasks.

Key words: multimodal aspect-based sentiment analysis, dual-granularity image information, multimodal interaction, multimodal fusion, cross-modal attention