Spatio-Temporal Correlation Based Adaptive Feature Learning of Tracking Object

doi:10.3778/j.issn.1673-9418.2007002

Abstract

Abstract:

Object tracking has been a difficult problem in the field of vision in recent years. The core task is to continuously locate an object in video sequences and mark its location with bounding boxes. Most of the existing tracking methods use the idea of object detection, and separate the video sequence by frame to detect the target separately. Although this strategy makes full use of the current frame information, it ignores the spatio-temporal correlation information among frames. However, the spatio-temporal correlation information is the key of adapting to the change of the target??s appearance and fully representing the target. To solve this problem, this paper proposes a spatio-temporal siamese network (STSiam) based on spatio-temporal correlation. STSiam uses the spatio-temporal correlation information for target locating and real-time tracking in two stages: object localization and object repre-sentation. In the stage of object localization, STSiam adaptively captures the features of the target and its surroun-ding area, and updates the target matching template to ensure that it is not affected by appearance changes. In the stage of object representation, STSiam pays attention to the spatial correlation information between corresponding regions in different frames. By using the object localization, STSiam locates the target area and learns the target bounding box correction parameters to ensure that the bounding box fits the target as closely as possible. The model??s network architecture is based on offline training, and it is no need to update model parameters during online tracking to ensure its real-time tracking speed. Extensive experiments on visual tracking benchmarks including OTB2015, VOT2016, VOT2018 and LaSOT demonstrate that STSiam achieves state-of-the-art performance in terms of accu-racy, robustness and speed compared with existing methods.

Key words: spatio-temporal correlation, feature, tracking, object localization, object representation

摘要：

目标追踪是近年来视觉领域的一个研究难题，其核心任务是在视频序列中持续定位目标并使用边界框标注其位置。已有的追踪方法大多采用目标检测的思路，将视频序列按帧分开对目标进行单独检测。这种策略尽管充分利用了当前帧信息，却忽略了帧与帧之间的时空关联信息，而这些信息是适应目标外观变化并完整检测目标的关键。为解决这一问题，提出了时空关联的自适应追踪目标特征学习框架时空孪生网络（STSiam），该模型利用视频序列间时空关联信息，通过目标定位和目标表征两个阶段，对目标进行准确定位和实时追踪。目标定位阶段，STSiam自适应地捕捉目标及其周围的变化特征，更新目标匹配模板，确保其尽量免受外观变化影响；目标表征阶段，STSiam关注不同帧对应区域之间的空间关联信息，利用目标定位锁定区域并学习目标边界框修正参数，确保边界框尽量贴合目标。该模型网络架构基于离线训练，在线追踪时无需更新模型参数，确保其实时追踪速度。在广泛使用的OTB2015、VOT2016、VOT2018和LaSOT数据集上进行了一系列实验验证，相较于已有方法，STSiam在准确率、鲁棒性和速度方面均取得领先性能。

关键词: 时空关联, 特征, 追踪, 目标定位, 目标表征

GUO Mingzhe, CAI Zixin, WANG Xinyue, JING Liping, YU Jian. Spatio-Temporal Correlation Based Adaptive Feature Learning of Tracking Object[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(6): 1049-1061.

郭明哲, 才子昕, 王馨月, 景丽萍, 于剑. 时空关联自适应追踪目标特征学习[J]. 计算机科学与探索, 2021, 15(6): 1049-1061.

References

[1] LEE K H, HWANG J N. On-road pedestrian tracking across multiple driving recorders[J]. IEEE Transactions on Multi-media, 2015, 17(9): 1429-1438.
[2] TANG S Y, ANDRILUKA M, ANDRES B, et al. Multiple people tracking by lifted multicut and person re-identifica-tion[C]//Proceedings of the 2017 IEEE Conference on Com-puter Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3701-3710.
[3] LUCAS B D, KANADE T. An iterative image registration technique with an application to stereo vision[C]//Procee-dings of the 7th International Joint Conference on Artificial Intelligence, Vancouver, Aug 24-28, 1981: 674-679.
[4] JIANG B R, LUO R X, MAO J Y, et al. Acquisition of loca-lization confidence for accurate object detection[C]//LNCS 11218: Proceedings of the 15th European Conference on Com-puter Vision, Munich, Sep 8-14, 2018. Berlin, Heidelberg: Springer, 2018: 816-832.
[5] ZHU Z, HUANG G, ZOU W, et al. UCT: learning unified convolutional networks for real-time visual tracking[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 1973-1982.
[6] BOLME D S, BEVERIDGE J R, DRAPER B A, et al. Visual object tracking using adaptive correlation filter[C]//Procee-dings of the 2010 23rd IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, Jun 13-18, 2010. Washington: IEEE Computer Society, 2010: 2544-2550.
[7] HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-speed tracking with kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 37(3): 583-596.
[8] DANELLJAN M, ROBINSON A, KHAN F S, et al. Beyond correlation filters: learning continuous convolution operators for visual tracking[C]//LNCS 9909: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Berlin, Heidelberg: Springer, 2016: 472-488.
[9] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional siamese networks for object tracking[C]//LNCS 9914: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 8-10 and 15-16, 2016. Berlin, Heidelberg: Springer, 2016: 850-865.
[10] CHEN K, TAO W. Once for all: a two-flow convolutional neural network for visual tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 28(12): 3377-3386.
[11] HELD D, THRUN S, SAVARESE S. Learning to track at 100 fps with deep regression networks[C]//LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Berlin, Heidelberg: Springer, 2016: 749-765.
[12] LECUN Y, BOSER B E, DENKER J S, et al. Backpropaga-tion applied to handwritten zip code recognition[J]. Neural Computation, 1989, 1(4): 541-551.
[13] DANELLJAN M, BHAT G, KHAN F S, et al. Atom: accu-rate tracking by overlap maximization[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4660-4669.
[14] ZHOU P, NI B B, GENG C, et al. Scale-transferrable object detection[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 528-537.
[15] GIRSHICK R B, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 580-587.
[16] REN S Q, HE K M, GIRSHICK R B, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, Dec 7-12, 2015. Red Hook: Curran Associates, 2015: 91-99.
[17] WU Y, LIM J, YANG M H. Online object tracking: a bench-mark[C]//Proceedings of the 2013 IEEE Conference on Com-puter Vision and Pattern Recognition, Portland, Jun 23-28, 2013. Washington: IEEE Computer Society, 2013: 2411-2418.
[18] KRISTAN M, LEONARDIS A, MATAS J, et al. The visual object tracking VOT2016 challenge results[C]//LNCS 9914: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 8-10 and 15-16, 2016. Berlin, Heidelberg: Springer, 2016: 777-823.
[19] KRISTAN M, LEONARDIS A, MATAS J, et al. The sixth visual object tracking VOT2018 challenge results[C]//LNCS 11129: Proceedings of the 15th European Conference on Com-puter Vision, Munich, Sep 8-14, 2018. Berlin, Heidelberg: Springer, 2018: 3-53.
[20] FAN H, LIN L T, YANG F, et al. LaSOT: a high-quality benchmark for large-scale single object tracking[C]//Procee-dings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Pis-cataway: IEEE, 2019: 5374-5383.
[21] LIPTON Z C, BERKOWITZ J, ELKAN C. A critical review of recurrent neural networks for sequence learning[J]. arXiv:1506.00019, 2015.
[22] LI B, YAN J J, WU W, et al. High performance visual tracking with siamese region proposal network[C]//Procee-dings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 8971-8980.
[23] HOCHREITER S, SCHMIDHUBER J. Long short-term me-mory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[24] CHO K, VAN MERRI?NBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv:1406.1078, 2014.
[25] KOMBRINK S, MIKOLOV T, KARAFIáT M, et al. Recu-rrent neural network based language modeling in meeting recognition[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Aug 27-31, 2011: 2877-2880.
[26] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequ-ence learning with neural networks[C]//Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Red Hook: Curran Associates, 2014: 3104-3112.
[27] GAN Q, GUO Q, ZHANG Z, et al. First step toward model-free, anonymous object tracking with recurrent neural net-works[J]. arXiv:1511.06425, 2015.
[28] KAHOU S E, MICHALSKI V, MEMISEVIC R, et al. RATM: recurrent attentive tracking model[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 1613-1622.
[29] REDMON J, DIVVALA S K, GIRSHICK R B, et al. You only look once: unified, real-time object detection[C]//Pro-ceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Wa-shington: IEEE Computer Society, 2016: 779-788.
[30] NING G H, ZHANG Z, HUANG C, et al. Spatially super-vised recurrent convolutional neural networks for visual object tracking[C]//Proceedings of the 2017 IEEE Interna-tional Symposium on Circuits and Systems, Baltimore, May 28-31, 2017. Piscataway: IEEE, 2017: 1-4.
[31] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Image-Net classification with deep convolutional neural networks[C]//Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, Dec 3-6, 2012. Red Hook: Curran Associates, 2012: 1097-1105.
[32] SHI X J, CHEN Z R, WANG H, et al. Convolutional LSTM network: a machine learning approach for precipitation now-casting[C]//Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, Dec 7-12, 2015. Red Hook: Curran Associates, 2015: 802-810.
[33] HUANG L H, ZHAO X, HUANG K Q. GOT-10k: a large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[34] DANELLJAN M, BHAT G, KHAN F S, et al. ECO: effi-cient convolution operators for tracking[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 6931-6939.
[35] NAM H, HAN B. Learning multi-domain convolutional neural networks for visual tracking[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 4293-4302.
[36] SUN C, WANG D, LU H C, et al. Learning spatial-aware regressions for visual tracking[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recogni-tion, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 8962-8970.
[37] FAN H, LING H B. Parallel tracking and verifying: a frame-work for real-time and high accuracy visual tracking[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 5486-5495.
[38] SONG Y B, MA C, GONG L J, et al. CREST: convolu-tional residual learning for visual tracking[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Com-puter Society, 2017: 2555-2564.
[39] BERTINETTO L, VALMADRE J, GOLODETZ S, et al. Staple: complementary learners for real-time tracking[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 1401-1409.
[40] WANG Q, ZHANG L, BERTINETTO L, et al. Fast online object tracking and segmentation: a unifying approach[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 1328-1338.
[41] ZHANG Z P, PENG H W. Deeper and wider siamese net-works for real-time visual tracking[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Re-cognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4591-4600.
[42] NAM H, BAEK M, HAN B. Modeling and propagating CNNs in a tree structure for visual tracking[J]. arXiv:1608. 07242, 2016.
[43] XU T Y, FENG Z H, WU X J, et al. Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking[J]. IEEE Transactions on Image Processing, 2019, 28(11): 5596-5609.
[44] LI B, WU W, WANG Q, et al. SiamRPN++: evolution of siamese visual tracking with very deep networks[C]//Pro-ceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4282-4291.
[45] ZHANG P, YU S J, XU J M, et al. Robust visual tracking using multi-frame multi-feature joint modeling[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 29(12): 3673-3686.
[46] BHAT G, JOHNANDER J, DANELLJAN M, et al. Unveiling the power of deep tracking[C]//LNCS 11206: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Berlin, Heidelberg: Springer, 2018: 493-509.
[47] SUN C, WANG D, LU H C, et al. Correlation tracking via joint discrimination and reliability learning[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 489-497.
[48] ZHU Z, WANG Q, LI B, et al. Distractor-aware siamese networks for visual object tracking[C]//LNCS 11213: Pro-ceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Berlin, Heidelberg: Springer, 2018: 103-119.
[49] HUANG L, ZHAO X, HUANG K. GlobalTrack: a simple and strong baseline for long-term tracking[J]. arXiv:1912. 08531, 2019.
[50] ZHANG Y, WANG D, WANG L, et al. Learning regression and verification networks for long-term visual tracking[J]. arXiv:1809.04320, 2018.
[51] YAN B, ZHAO H J, WANG D, et al. ‘Skimming-Perusal’ tracking: a framework for real-time and robust long-term tracking[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 2385-2393.
[52] SONG Y B, MA C, WU X H, et al. Vital: visual tracking via adversarial learning[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 8990-8999.