时空模板更新的Transformer目标跟踪算法

doi:10.3778/j.issn.1673-9418.2208034

摘要/Abstract

摘要： 目前主流Transformer目标跟踪算法只使用Transformer网络进行特征增强和特征融合，忽略了Transformer网络的特征提取能力，并且跟踪过程中对尺度变化、形变等干扰因素缺少有效的模板更新策略。针对上述问题，提出基于时空模板更新和边界框提升的Transformer跟踪算法。首先采用改进后的Swin Transformer作为骨干网络，通过移位窗口进行自注意力计算和全局信息建模，增强骨干网络的特征提取能力；其次使用Transformer编码器-解码器结构融合模板区域和搜索区域信息，利用注意力机制建立特征关联以获取全局语义信息，同时跟踪过程中每隔固定帧根据置信度分数大小动态更新模板，用于调整模板外观状态；最后采用边界框提升模块精细化边界框的回归范围，提升算法的精度。在多个具有挑战性的数据集上与主流先进算法进行性能对比实验，在OTB2015数据集上成功率和精确率分别达到70.2%和91.0%，在GOT-10k数据集上平均重合度相较于基准算法TransT提升了0.02，在LaSOT数据集上成功率相较于基准算法TransT提升了0.024，并且能以42 FPS的跟踪速度进行实时跟踪。

关键词: 目标跟踪, Transformer网络, 时空模板, 边界框提升

Abstract: Currently, the mainstream Transformer tracking algorithm only uses Transformer for feature enhancement and feature fusion, ignoring the Transformer??s feature extraction ability, and lacks an effective template update strategy for disturbing factors such as scale change and deformation during the tracking process. Aiming at above problems, a Transformer tracking algorithm based on spatio-temporal template updating and bounding box refining is proposed. Firstly, the improved Swin Transformer is used as the backbone network, and self-attention calculation and global information modeling are performed by shifting windows to enhance the feature extraction ability of the backbone network. Secondly, the Transformer encoder-decoder structure is used to fuse the template area and search area infor-mation, and the attention mechanism is used to establish feature correlation. At the same time, the template is dynamically updated according to the size of confidence score every fixed frame to adjust the appearance state of the template during the tracking process. Finally, the bounding box refinement module is used to refine the regression range of the bounding box and improve the accuracy of the algorithm. Performance comparison experiments with mainstream advanced algorithms have been performed on multiple challenging datasets. The success rate and precision on the OTB2015 dataset respectively reach 70.2% and 91.0%. The average overlap on the GOT-10k dataset is improved 0.02 compared with benchmark algorithm TransT, the success rate on the LaSOT dataset is increased by 0.024 compared with the benchmark algorithm TransT, and it can also perform real-time tracking at a tracking speed of 42 FPS.

Key words: object tracking, Transformer network, spatio-temporal template, bounding box refinement

汪强, 卢先领. 时空模板更新的Transformer目标跟踪算法[J]. 计算机科学与探索, 2023, 17(9): 2161-2173.

WANG Qiang, LU Xianling. Transformer Object Tracking Algorithm Based on Spatio-Temporal Template Update[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(9): 2161-2173.

参考文献

[1] 刘艺, 李蒙蒙, 郑奇斌, 等. 视频目标跟踪算法综述[J]. 计算机科学与探索, 2022, 16(7): 1504-1515.
LIU Y, LI M M, ZHENG Q B, et al. Survey on video object tracking algorithms[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1504-1515.
[2] 赵运基, 范存良, 张新良. 融合多特征和通道感知的目标跟踪算法[J]. 计算机科学与探索, 2022, 16(6): 1417-1428.
ZHAO Y J, FAN C L, ZHANG X L. Object tracking algo-rithm with fusion of multi-feature and channel awareness[J]. Journal of Frontiers of Computer Science and Techno-logy, 2022, 16(6): 1417-1428.
[3] 张晶, 黄浩淼. 结合重检测机制的多卷积层特征响应跟踪算法[J]. 计算机科学与探索, 2021, 15(3): 533-544.
ZHANG J, HUANG H M. Multi-convolutional layer feature response tracking algorithm combined with re-detection me-chanism[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(3): 533-544.
[4] 程世龙, 谢林柏, 彭力. 梯度导向的通道选择目标跟踪算法[J]. 计算机科学与探索, 2022, 16(3): 649-660.
CHENG S L, XIE L B, PENG L. Gradient-guided object trac-king algorithm with channel selection[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(3): 649-660.
[5] 李彪, 孙瑾, 李星达, 等. 自适应特征融合的相关滤波跟踪算法[J]. 计算机工程与应用, 2022, 58(9): 208-218.
LI B, SUN J, LI X D, et al. Correlation filter target tracking based on adaptive multi-feature fusion[J]. Computer Enginee-ring and Applications, 2022, 58(9): 208-218.
[6] 茅正冲, 陈海东. 自适应尺度的上下文感知相关滤波跟踪算法[J]. 计算机工程与应用, 2021, 57(3): 168-174.
MAO Z C, CHEN H D. Adaptive scale context-aware corre-lation filter tracking algorithm[J]. Computer Engineering and Applications, 2021, 57(3): 168-174.
[7] 张艳琳, 钱小燕, 张淼, 等. 自适应多特征融合相关滤波目标跟踪[J]. 中国图象图形学报, 2020, 25(6): 1160-1170.
ZHANG Y L, QIAN X Y, ZHANG M, et al. Correlation filter target tracking algorithm based on adaptive multifeature fusion [J]. Chinese Journal of Image and Graphics, 2020, 25(6): 1160-1170.
[8] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional siamese networks for object tracking[C]//LNCS 9914: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 8-16, 2016. Cham: Springer, 2016: 850-865.
[9] KRIZHEVSKY A, SUTSKRVER I, HINTON G. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[10] LI B, YAN J J, WU W, et al. High performance visual tracking with siamese region proposal network[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Re-cognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 8971-8980.
[11] LI B, WU Q, ZHANG F Y, et al. SiamRPN++: evolution of siamese visual tracking with very deep networks[C]//Procee-dings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscata-way: IEEE, 2019: 4282-4291.
[12] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, Long Beach, Dec 4-9, 2017: 5998-6008.
[14] CHEN X, YAN B, ZHU J W, et al. Transformer tracking[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 8126-8135.
[15] WANG N, ZHOU W G, WANG J, et al. Transformer meets tracker: exploiting temporal context for robust visual tracking[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 1571-1580.
[16] BHAT G, DANELLJAN M, GOOL L V, et al. Learning dis-criminative model prediction for tracking[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 6182-6191.
[17] YAN B, PENG H W, FU J L, et al. Learning spatio-temporal transformer for visual tracking[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 10428-10437.
[18] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierar-chical vision transformer using shifted windows[C]//Procee-dings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 9992-10002.
[19] LIN L T, FAN H, XU Y, et al. SwinTrack: a simple and strong baseline for transformer tracking[J]. arXiv:2112.00995, 2021.
[20] GAO S Y, ZHOU C L, MA C, et al. AiATrack: attention in attention for transformer visual tracking[J]. arXiv:2207.09603, 2022.
[21] CUI Y T, JIANG C, WANG L M, et al. MixFormer: end-to-end tracking with iterative mixed attention[J]. arXiv:2203.11082, 2022.
[22] GLOROT X, BORDES A, BENGIO Y, et al. Deep sparse rectifier neural networks[C]//Proceedings of the 14th Inter-national Conference on Artificial Intelligence and Statistics, Fort Lauderdale, Apr 11-13, 2011: 315-323.
[23] YAN B, ZHANG X Y, WANG D, et al. Alpha-refine: boosting tracking performance by precise bounding box estimation[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 5289-5298.
[24] REZATOFIGHI H, TSOI N, GWAK J, et al. Generalized intersection over union: a metric and a loss for bounding box regression[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 658-666.
[25] WU Y, LIM J, YANG M H. Online object tracking: a ben-chmark[C]//Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, Jun 23-28, 2013. Washington: IEEE Computer Society, 2013: 2411-2418.
[26] HUANG L, ZHAO X, HUANG K. GOT-10k: a large high diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 2021, 43(5): 1562-1577.
[27] FAN H, LIN L, YANG F, et al. LaSOT: a high-quality bench-mark for large-scale single object tracking[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 5374-5383.
[28] LOSHCHILOV I, HUTTER F. Decoupled weight decay re-gularization[J]. arXiv:1711.05101, 2017.
[29] JIANG B R, LUO R X, MAO J Y, et al. Acquisition of lo-calization confidence for accurate object detection[C]//LNCS 11218: Proceedings of the 15th European Conference on Com-puter Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 816-832.
[30] WANG Q, ZHANG L, BERTINETTO L, et al. Fast online object tracking and segmentation: a unifying approach[C]// Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 1328-1338.
[31] DANELLJAN M, BHAT G, KHAN F S, et al. ATOM: ac-curate tracking by overlap maximization[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4655-4664.
[32] XU Y D, WANG Z Y, LI Z X, et al. SiamFC++: towards robust and accurate visual tracking with target estimation guidelines[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Sym-posium on Educational Advances in Artificial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 12549-12556.
[33] ZHANG Z P, PENG H W, FU J L, et al. Ocean: object-aware anchor-free tracking[C]//LNCS 12366: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Jun 16-18, 2020. Cham: Springer, 2020: 771-787.