计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (4): 868-878.DOI: 10.3778/j.issn.1673-9418.2107010

• 图形·图像 • 上一篇    下一篇

融合注意力特征的精确视觉跟踪

胡硕,姚美玉,孙琳娜,王洁,周思恩   

  1. 燕山大学 电气工程学院,河北 秦皇岛 066004
  • 出版日期:2023-04-01 发布日期:2023-04-01

Accurate Visual Tracking with Attention Feature

HU Shuo, YAO Meiyu, SUN Linna, WANG Jie, ZHOU Si'en   

  1. School of Electrical Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China
  • Online:2023-04-01 Published:2023-04-01

摘要: 近年来,特征融合在视觉跟踪系统的准确性和鲁棒性方面发挥着重要作用。传统的特征融合方法通常通过直接求和或引入注意力机制进行融合。且在分类网络中,只使用一层特征进行分类,忽视了为鲁棒模型的不同级别的特征分配适当的权重的重要性。针对这一问题,提出了一种基于深度学习的注意力融合目标跟踪算法。首先,提出了一种基于ResNet改进的网络结构,引入一个注意力机制,形成一个迭代的注意力模块,将其原来的直接相加的融合方式替换为注意力特征融合方式。改进的网络结构更有利于不同层次特征的融合。其次,将从骨干网络中提取的第三层和第四层特征送入分类器,将得到的响应图进行融合,获得粗略位置。与此同时,将所提取的特征送入到注意力机制网络中,以分配不同的权重,然后馈送到估计网络中,以执行精确回归框估计。通过实验比较可知,该算法的精确度和成功率均有所提升,并且该算法对不同场景中的目标存在的各种干扰均具有更强的鲁棒性。实验表明了该跟踪器的有效性和高效性。

关键词: 视觉跟踪, 特征融合, 注意力机制, 非线性融合

Abstract: Recently, the feature fusion plays a vital role in terms of accuracy and robustness for a visual tracking system. Traditional feature fusion methods usually use direct summation or attention mechanism. In the classifi-cation network, only one layer of features is used for classification, and the importance of assigning appropriate weights to different levels of features of robust model is ignored. To tackle this issue, a novel deep-leaning based tracker with attention fusion is proposed in this paper. Firstly, an improved network structure based on ResNet is proposed, an attention mechanism is introduced to form an iterative attention module, and its original direct addition fusion method is replaced by attention feature fusion method. The improved network structure is more conducive to the integration of different levels of features. Secondly, the third and fourth layer features extracted from the backbone network are sent to the classifier, and the response graph is fused to obtain the rough position. At the same time, the extracted features are fed into the attention mechanism network to assign different weights, and then fed into the estimation network to perform accurate regression box estimation. Through experimental comparison, it can be seen that the accuracy and success rate of this method are improved, and this algorithm is more robust to various interferences of targets in different scenes. Experiments show the effectiveness and efficiency of the proposed tracker.

Key words: visual tracking, feature fusion, attention mechanism, nonlinear fusion