融合注意力特征的精确视觉跟踪

doi:10.3778/j.issn.1673-9418.2107010

摘要/Abstract

摘要： 近年来，特征融合在视觉跟踪系统的准确性和鲁棒性方面发挥着重要作用。传统的特征融合方法通常通过直接求和或引入注意力机制进行融合。且在分类网络中，只使用一层特征进行分类，忽视了为鲁棒模型的不同级别的特征分配适当的权重的重要性。针对这一问题，提出了一种基于深度学习的注意力融合目标跟踪算法。首先，提出了一种基于ResNet改进的网络结构，引入一个注意力机制，形成一个迭代的注意力模块，将其原来的直接相加的融合方式替换为注意力特征融合方式。改进的网络结构更有利于不同层次特征的融合。其次，将从骨干网络中提取的第三层和第四层特征送入分类器，将得到的响应图进行融合，获得粗略位置。与此同时，将所提取的特征送入到注意力机制网络中，以分配不同的权重，然后馈送到估计网络中，以执行精确回归框估计。通过实验比较可知，该算法的精确度和成功率均有所提升，并且该算法对不同场景中的目标存在的各种干扰均具有更强的鲁棒性。实验表明了该跟踪器的有效性和高效性。

关键词: 视觉跟踪, 特征融合, 注意力机制, 非线性融合

Abstract: Recently, the feature fusion plays a vital role in terms of accuracy and robustness for a visual tracking system. Traditional feature fusion methods usually use direct summation or attention mechanism. In the classifi-cation network, only one layer of features is used for classification, and the importance of assigning appropriate weights to different levels of features of robust model is ignored. To tackle this issue, a novel deep-leaning based tracker with attention fusion is proposed in this paper. Firstly, an improved network structure based on ResNet is proposed, an attention mechanism is introduced to form an iterative attention module, and its original direct addition fusion method is replaced by attention feature fusion method. The improved network structure is more conducive to the integration of different levels of features. Secondly, the third and fourth layer features extracted from the backbone network are sent to the classifier, and the response graph is fused to obtain the rough position. At the same time, the extracted features are fed into the attention mechanism network to assign different weights, and then fed into the estimation network to perform accurate regression box estimation. Through experimental comparison, it can be seen that the accuracy and success rate of this method are improved, and this algorithm is more robust to various interferences of targets in different scenes. Experiments show the effectiveness and efficiency of the proposed tracker.

Key words: visual tracking, feature fusion, attention mechanism, nonlinear fusion

胡硕, 姚美玉, 孙琳娜, 王洁, 周思恩. 融合注意力特征的精确视觉跟踪[J]. 计算机科学与探索, 2023, 17(4): 868-878.

HU Shuo, YAO Meiyu, SUN Linna, WANG Jie, ZHOU Si'en. Accurate Visual Tracking with Attention Feature[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(4): 868-878.

参考文献

[1] SMEULDERS A W M, CHU D M, CUCCHIARA R, et al. Visual tracking: an experimental survey[J]. IEEE Transa-ctions on Pattern Analysis and Machine Intelligence, 2013, 36(7): 1442-1468.
[2] TRUCCO E, PLAKAS K. Video tracking: a concise survey[J]. IEEE Journal of Oceanic Engineering, 2006, 31(2): 520-529.
[3] TSAGKATAKIS G, SAVAKIS A. Online distance metric learning for object tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2011, 21(12): 1810-1821.
[4] MING Y, MENG X, FAN C, et al. Deep learning for mono-cular depth estimation: a review[J]. Neurocomputing, 2021, 438: 14-33.
[5] ZHANG X, YU Q, YU H. Physics inspired methods for crowd video surveillance and analysis: a survey[J]. IEEE Access, 2018, 6: 66816-66830.
[6] DANELLJAN M, BHAT G, KHAN F S, et al. ATOM: acc-urate tracking by overlap maximization[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4660-4669.
[7] XU Y D, WANG Z Y, LI Z X, et al. SiamFC++: towards robust and accurate visual tracking with target estimation guidelines[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symp-osium on Educational Advances in Artificial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 12549- 12556.
[8] LUKEZIC A, VOJIR T, ZAJC L C, et al. Discriminative correlation filter with channel and spatial reliability[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 4847-4856.
[9] MA C, HUANG J B, YANG X K, et al. Hierarchical convo-lutional features for visual tracking[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Soci-ety, 2015: 3074-3082.
[10] BHAT G, JOHNANDER J, DANELLJAN M, et al. Unvei-ling the power of deep tracking[C]//LNCS 11206: Procee-dings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 483-498.
[11] VALMADRE J, BERTINETTO L, HENRIQUES J F, et al. End-to-end representation learning for correlation filter based tracking[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 5000-5008.
[12] REN S Q, HE K M, GIRSHICK R B, et al. Faster R-CNN: towards real-time object detection with region proposal net-works[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2015, Montreal, Dec 7-12, 2015. Red Hook: Curran Associates, 2015: 91-99.
[13] ZHU Z, WANG Q, LI B, et al. Distractor-aware siamese networks for visual object tracking[C]//LNCS 11213: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 103-119.
[14] LI B, WU W, WANG Q, et al. SiamRPN++: evolution of siamese visual tracking with very deep networks[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4282-4291.
[15] LI B, YAN J J, WU W, et al. High performance visual tracking with siamese region proposal network[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22,2018. Washington: IEEE Computer Society, 2018: 8971-8980.
[16] JIANG B R, LUO R X, MAO J Y, et al. Acquisition of localization confidence for accurate object detection[C]//LNCS 11218: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Sprin-ger, 2018: 816-832.
[17] DAI Y M, GIESEKE F, OEHMCKE S, et al. Attentional feature fusion[C]//Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, Jan 3-8, 2021. Piscataway: IEEE, 2021: 3559-3568.
[18] HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-speed tracking with kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 37(3): 583-596.
[19] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional siamese networks for object tracking[C]// LNCS 9914: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 8-10, 2016. Cham: Springer, 2016: 850-865.
[20] BELLO I, ZOPH B, VASWANI A, et al. Attention augm-ented convolutional networks[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 3285-3294.
[21] FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]//Proceedings of the 2019 IEEE Conf-erence on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 3146-3154.
[22] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE Conference on Comp-uter Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7132-7141.
[23] FU K, FAN D P, JI G P, et al. JL-DCF: joint learning and densely cooperative fusion framework for RGB-D salient object detection[C]//Proceedings of the 2020 IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 3049-3059.
[24] ZHANG H, WU C, ZHANG Z, et al. ResNeSt: split atten-tion networks[J]. arXiv:2004.08955, 2020.
[25] MA Z, WANG L Y, ZHANG H T, et al. RPT: learning point set representation for siamese visual tracking[C]//LNCS 12539: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 653-665.