计算机科学与探索

• 学术研究 •    下一篇

联合双粒度图像信息的多模态方面级情感分析

许威,张晓琳,张换香,张景   

  1. 1. 内蒙古科技大学 数智产业学院,内蒙古 包头 014010
    2. 内蒙古科技大学 创新创业教育学院,内蒙古 包头 014010
    3. 上海大学 计算机工程与科学学院,上海 200444
    4. 内蒙古科技大学 理学院,内蒙古 包头 014010

Combining Two Granularity Image Information For Multi-modal Aspect-Based Sentiment Analysis

XU Wei, ZHANG Xiaolin, ZHANG Huanxiang, ZHANG Jing   

  1. 1. School of Digital and Intelligence Industry, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China
    2. School of Innovation and Entrepreneurship Education, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China
    3. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
    4. School of Science, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China

摘要: 多模态方面级情感分析(Multimodal Aspect-Based Sentiment Analysis,MABSA)作为一种细粒度情感分析技术,旨在通过整合多种模态的特征数据来提高该领域的精度和效果。针对现有的多模态方面级情感分析的研究大多集中在文本和图像模态间的跨模态对齐上,忽略了图像的粗细粒度特征信息对MABSA子任务的潜在贡献。为此,本文提出一种联合双粒度图像信息的多模态方面级情感分析方法(Combining Two Granularity Image Information for Multi-Modal Aspect-Based Sentiment Analysis,CTGI)。具体地,在多模态方面词提取任务中,为增强图像与文本模态的交互,利用ClipCap获取图像的粗粒度特征描述文本,作为图像提示信息,辅助模型预测文本中的方面词及其属性。在多模态方面词情感分类中,为了捕获丰富的图像细粒度情感特征,通过跨模态注意力机制,将带有原始情感语义的图像底层特征与掩码后的文本经过多层深度交互,强化图像特征到文本特征的融合。通过在两个公共的Twitter数据集和Restaurant+数据集上的实验结果表明,CTGI的表现优于当前的基线模型,验证了图像粗细粒度对MABSA子任务不同贡献度的合理性。

关键词: 多模态方面级情感分析, 双粒度图像信息, 多模态交互, 多模态融合, 跨模态注意力

Abstract: Multimodal aspect-based Sentiment Analysis (Multimodal Aspect-Based Sentiment Analysis ,MABSA), as a fine-grained sentiment analysis technique, aims to improve the accuracy and effectiveness of the field by integrating feature data from multiple modes. Most of the existing researches on multimodal aspect-based sentiment analysis focus on the cross-modal alignment between text and image modes, ignoring the potential contribution of image coarse-grained feature information to MABSA subtasks. Therefore, a multimodal aspect-based sentiment analysis method combining two granularity image information(Combining Two Granularity Image Information for Multi-Modal Aspect-Based Sentiment Analysis ,CTGI) is proposed in this paper. Specifically, in the multi-modal aspect term extraction task, in order to enhance the interaction between image and text modes, ClipCap is used to obtain the coarse-grained feature description text of the image, which is used as image prompt information to assist the model to predict the aspect terms and their attributes in the text. In terms of multi-modal emotion classification, in order to capture rich fine-grained emotional features of images, the cross-modal attention mechanism is used to interact the underlying features of images with original emotional semantics with the masked text through multiple layers of depth, so as to strengthen the fusion of image features into text features. Experimental results on two public Twitter datasets and Restaurant+ datasets show that CTGI performs better than the current baseline model, which validates the rationality of different contribution degrees of image coarse-granularity to MABSA subtasks.

Key words: Multi-modal Aspect-based Sentiment Analysis, Two Granularity Image Information, Multi-modal interaction, Multi-modal fusion, Cross-modal Attention