计算机科学与探索 ›› 2017, Vol. 11 ›› Issue (12): 2033-2040.DOI: 10.3778/j.issn.1673-9418.1704047

• 人工智能与模式识别 • 上一篇    

融合注意力和动态语义指导的图像描述模型

张  威+,周治平   

  1. 江南大学 物联网技术应用教育部工程研究中心,江苏 无锡 214122
  • 出版日期:2017-12-01 发布日期:2017-12-07

Image Caption Generation Model with Visual Attention and Dynamic Semantic Information Guiding

ZHANG Wei+, ZHOU Zhiping   

  1. Engineering Research Center of Internet of Things Technology Applications of Ministry of Education, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2017-12-01 Published:2017-12-07

摘要: 针对当前图像语义描述生成模型对图像内目标细节部分描述不充分问题,提出了一种结合图像动态语义指导和自适应注意力机制的图像语义描述模型。该模型根据上一时刻信息预测下一时刻单词,采用自适应注意力机制选择下一时刻模型需要处理的图像区域。此外,该模型构建了图像的密集属性信息作为额外的监督信息,使得模型可以联合图像语义信息和注意力信息进行图像内容描述。在Flickr8K和Flickr30K图像集中进行了训练和测试,并且使用了不同的评估方法对所提模型进行了验证,实验结果表明所提模型性能有较大的提高,尤其与Guiding-Long Short-Term Memory模型相比,得分提高了4.1、1.8、2.4、0.8、3.1,提升幅度达到6.3%、4.0%、7.9%、3.9%、17.3%;与Soft-Attention相比,得分分别提高了1.9、2.4、3.3、1.5、2.74,提升幅度达到2.8%、5.5%、11.1%、7.5%、14.8%。

关键词: 图像标注生成, 图像内容描述, 深度神经网络, 视觉注意力, 语义信息

Abstract: Aiming at the problem that the current image semantic generation model does not adequately describe the details of the object in the images, this paper proposes an image content description structure which combines the dynamic semantic guidance of image and the adaptive attention mechanism. In the model, according to the last-time prediction word, the attention mechanism adaptively chooses the image part which will be processed in the next-time. In addition, the model builds dense image information as the additional monitoring information, so that makes the model description image associating the image semantic information with the attention information. The training and testing are done in Flickr8K and Flickr30K databases, the experimental results using different evaluations show that the proposed model has good performance. Especially, compared with Guiding-Long Short-Term Memory model, the score increases 4.1, 1.8, 2.4, 0.8, 3.1, up to 6.3%,4.0%,7.9%,3.9%,17.3%; Compared with Soft-Attention, the score improves 1.9, 2.4, 3.3, 1.5, 2.74, up to 2.8%, 5.5%, 11.1%, 7.5%, 14.8%.

Key words: image caption generation, image description, deep neural networks, visual attention mechanism, semantic information