Survey of Video Object Detection Based on Deep Learning

doi:10.3778/j.issn.1673-9418.2103107

Abstract

Abstract:

Video object detection is to solve the problem of object localization and recognition in every video frame. Compared with image object detection, video is featured by high redundancy, which contains a lot of local spatio-temporal information. With the rapid popularity of deep convolutional neural network in the field of static image object detection, it shows a great advantage over traditional methods in performance. Besides, it plays a due role in video-based object detection task. However, the current video object detection algorithms still face many challenges, such as improving and optimizing the performance of mainstream object detection algorithms, maintaining the spatiotemporal consistency of video sequences, and making detection of model lightweight. In view of the above problems and challenges, on the basis of investigating a large number of literature, this paper systematically sum-marizes the video object detection algorithm based on deep learning. Based on the basic methods like optical flow and detection, these algorithms are classified. In addition, in the angles of backbone network, algorithm structure and data sets etc., these methods are explored. Combined with the experimental results in the ImageNet VID data set, this paper analyzes the performance advantages and disadvantages of typical algorithms of this field, and the relations between these algorithms. As for video object detection, the problems to be solved as well as the future research direction are expounded and prospected. Video object detection has become a hot spot pursued by many computer vision scholars. More efficient and accurate algorithms will be proposed, and its development direction will be better and better.

Key words: deep learning, video object detection, optical flow, lightweight

摘要：

视频目标检测是为了解决每一个视频帧中出现的目标如何进行定位和识别的问题。相比于图像目标检测，视频具有高冗余度的特性，其中包含了大量的时空局部信息。随着深度卷积神经网络在静态图像目标检测领域的迅速普及，在性能上相较于传统方法显示出了非常大的优越性，并逐步在基于视频的目标检测任务上也发挥了应有的作用。但现有的视频目标检测算法仍然面临改进与优化主流目标检测算法的性能、保持视频序列的时空一致性、检测模型轻量化等关键技术的挑战。针对上述问题和挑战，在调研大量文献的基础上系统地对基于深度学习的视频目标检测算法进行了总结。从基于光流、检测等基础方法对这些算法进行了分类，从骨干网络、算法结构、数据集等角度细致探究了这些方法。结合在ImageNet VID等数据集上的实验结果，分析了该领域具有代表性算法的性能优势和劣势，以及算法之间存在的联系。对视频目标检测中待解决的问题与未来研究方向进行了阐述和展望。视频目标检测已成为众多的计算机视觉领域学者追逐的热点，将来会有更加高效、精度更高的算法被相继提出，其发展方向也会越来越好。

关键词: 深度学习, 视频目标检测, 光流, 轻量化

WANG Dicong, BAI Chenshuai, WU Kaijun. Survey of Video Object Detection Based on Deep Learning[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(9): 1563-1577.

王迪聪, 白晨帅, 邬开俊. 基于深度学习的视频目标检测综述[J]. 计算机科学与探索, 2021, 15(9): 1563-1577.

References

[1] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436.
[2] LIU T, ZHAO Y, WEI Y, et al. Concealed object detection for activate millimeter wave image[J]. IEEE Transactions on Industrial Electronics, 2019, 66(12): 9909-9917.
[3] LIU Z Y, WAN P P. Pedestrian re-identification feature extrac-tion method based on attention mechanism[J]. Journal of Computer Applications, 2020, 40(3): 672-676.
刘紫燕, 万培佩. 基于注意力机制的行人重识别特征提取方法[J]. 计算机应用, 2020, 40(3): 672-676.
[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 30th Annual Conference on Neural Information Processing Systems, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 5998-6008.
[5] LUO H L, PENG S, CHEN H K. Review on latest research progress of challenging problems in object detection[J]. Com-puter Engineering and Applications, 2021, 57(5): 36-46.
罗会兰, 彭珊, 陈鸿坤. 目标检测难点问题最新研究进展综述[J]. 计算机工程与应用, 2021, 57(5): 36-46.
[6] YAN H, HUANG J, LI R A, et al. Research on video SAR moving target detection algorithm based on improved faster region-based CNN[J]. Journal of Electronics & Information Technology, 2021, 43(3): 615-622.
闫贺, 黄佳, 李睿安, 等. 基于改进快速区域卷积神经网络的视频SAR运动目标检测算法研究[J]. 电子与信息学报, 2021, 43(3): 615-622.
[7] DU L, WEI D, LI L, et al. SAR target detection network via semi-supervised learning[J]. Journal of Electronics & Infor-mation Technology, 2020, 42(1): 154-163.
杜兰, 魏迪, 李璐, 等. 基于半监督学习的SAR目标检测网络[J]. 电子与信息学报, 2020, 42(1): 154-163.
[8] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//Proceedings of the 2005 IEEE Con-ference on Computer Vision and Pattern Recognition, San Diego, Jun 20-26, 2005. Washington: IEEE Computer Society, 2005: 886-893.
[9] LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110.
[10] FELZENSZWALB P F, GIRSHICK R B, MCALLESTER D A, et al. Object detection with discriminatively trained part-based models[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(9): 1627-1645.
[11] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2013 IEEE Confer-ence on Computer Vision and Pattern Recognition, Colum-bus, Jun 20-23, 2013. Washington: IEEE Computer Society, 2005: 580-587.
[12] ZITNICK C L, DOLLAR P. Edge boxes: locating object pro-posals from edges[C]//LNCS 8693: Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 391-405.
[13] RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
[14] LIN T, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//LNCS 8693: Proceedings of the 13th European Conference on Computer Vision, Sep 6-12, 2014. Cham: Springer, 2014: 740-755.
[15] KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th Annual Conference on Neural Infor-mation Processing Systems, Lake Tahoe, Dec 3-6, 2012. Red Hook: Curran Associates, 2012: 1106-1114.
[16] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Represen-tations, San Diego, May 7-9, 2015. Washington: IEEE Com-puter Society, 2015: 409-556.
[17] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convo-lutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-13, 2015. Washington: IEEE Computer Society, 2015: 1-9.
[18] IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the 32nd International Conference on Machine Learning, Lille, Jul 6-11, 2015: 448-456.
[19] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 2818-2826.
[20] SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, inception-ResNet and the impact of residual connections on learning[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, Feb 4-9, 2017. Menlo Park: AAAI, 2017: 4278-4284.
[21] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion, Las Vegas, Jun 27-30, 2016. Washington: IEEE Com-puter Society, 2016: 770-778.
[22] HE K M, ZHANG X Y, REN S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recogni-tion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916.
[23] GIRSHICK R. Fast R-CNN[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recogni-tion, Boston, Jun 7-13, 2015. Washington: IEEE Computer Society, 2015: 1440-1448.
[24] REN S Q, HE K M, GIRSHICK R B, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, Dec 7-12, 2015. Red Hook: Curran Associates, 2015: 91-99.
[25] DAI J F, LI Y, HE K M, et al. R-FCN: object detection via region-based fully convolutional networks[C]//Proceedings of the 29th Annual Conference on Neural Information Pro-cessing Systems, Barcelona, Dec 5-10, 2016. Red Hook: Curran Associates, 2016: 379-387.
[26] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 779-788.
[27] REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 6517-6525.
[28] REDMON J, FARHADI A. YOLOv3: an incremental im-provement[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018.
[29] BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: optimal speed and accuracy of object detection[J]. arXiv: 2004.10934, 2020.
[30] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]//LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 21-37.
[31] XIAO Y Q, YANG H M. Research on application of object detection algorithm in traffic scene[J]. Computer Engineering and Applications, 2021, 57(6): 30-41.
肖雨晴, 杨慧敏. 目标检测算法在交通场景中应用综述[J]. 计算机工程与应用, 2021, 57(6): 30-41.
[32] LIU Z Y, YUAN L, ZHU M C, et al. YOLOv3 traffic sign detection based on SPP and improved FPN[J]. Computer Engineering and Applications, 2021, 57(7): 164-170.
刘紫燕, 袁磊, 朱明成, 等. 融合SPP和改进FPN的YOLOv3交通标志检测[J]. 计算机工程与应用, 2021, 57(7): 164-170.
[33] GIBSON J J. The perception of the visual world[M]. Boston: Houghton Mifflin Harcourt, 1950.
[34] ZHU X Z, XIONG Y W, DAI J F, et al. Deep feature flow for video recognition[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 4141-4150.
[35] ZHU X Z, WANG Y J, DAI J F, et al. Flow-guided feature aggregation for video object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 408-417.
[36] ZHU X Z, DAI J F, YUAN L, et al. Towards high perfor-mance video object detection[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recogni-tion, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7210-7218.
[37] ZHU X Z, DAI J F, ZHU X C, et al. Towards high perfor-mance video object detection for mobiles[J]. arXiv:1804. 05830, 2018.
[38] KANG K, OUYANG W L, LI H S, et al. Object detection from video tubelets with convolutional neural networks[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 817-825.
[39] CHEN Y H, CAO Y, WANG L W. Memory enhanced global-local aggregation for video object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 10334-10343.
[40] HAN M F, WANG Y L, CHANG X J, et al. Mining inter-video proposal relations for video object detection[C]//LNCS 12366: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 431-446.
[41] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Detect to track and track to detect[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 3057-3065.
[42] WANG R J, LI X, LING C X. Pelee: a real-time object detection system on mobile devices[J]. arXiv:1804.06882, 2018.
[43] LIU M, ZHU M L, WHITE M, et al. Looking fast and slow: memory-guided mobile video object detection[J]. arXiv:1903.10172, 2019.
[44] HAN K, WANG Y H, CHEN H T, et al. A survey on visual transformer[J]. arXiv:2012.12556, 2020.
[45] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[J]. arXiv:2005.14165, 2020.
[46] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[47] BELTAGY I, LO K, COHAN A. SciBERT: a pretrained language model for scientific text[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, Nov 3-7, 2019. Stroudsburg: ACL, 2019: 3613-3618.
[48] LEE J, YOON W, KIM S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
[49] ZHAO Y Q, RAO Y, DONG S P, et al. Survey on deep learning object detection[J]. Journal of Image and Graphics, 2020, 25(4): 629-654.
赵永强, 饶元, 董世鹏, 等. 深度学习目标检测方法综述[J]. 中国图象图形学报, 2020, 25(4): 629-654.
[50] XU D G, WANG L, LI F. Review of typical object detection algorithms for deep learning[J]. Computer Engineering and Applications, 2021, 57(8): 10-25.
许德刚, 王露, 李凡. 深度学习的典型目标检测算法研究综述[J]. 计算机工程与应用, 2021, 57(8): 10-25.
[51] ZOU Z X, SHI Z W, GUO Y H, et al. Object detection in 20 years: a survey[J]. arXiv:1905.05055, 2019.
[52] BILKHU M, WANG S Y, DOBHAL T. Attention is all you need for videos: self-attention based video summarization using universal transformers[J]. arXiv:1906.02792, 2019.
[53] KHAN S, NASEER M, HAYAT M, et al. Transformers in vision: a survey[J]. arXiv:2101.01169, 2021.
[54] TAY Y, DEHGHANI M, BAHRI D, et al. Efficient trans-formers: a survey[J]. arXiv:2009.06732, 2020.
[55] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//LNCS 12346: Procee-dings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 213-229.
[56] BELLO I. LambdaNetworks: modeling long-range interactions without attention[J]. arXiv:2102.08602, 2021.
[57] ZHANG D, ZHANG H W, TANG J H, et al. Feature pyramid transformer[C]//LNCS 12373: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 323-339.
[58] EVERINGHAM M, VAN GOOL L, WILLIAMS C K I, et al. The pascal visual object classes (VOC) challenge[J]. Inter-national Journal of Computer Vision, 2010, 88(2): 303-338.
[59] EVERINGHAM M, ESLAMI S M A, VAN GOOL L, et al. The pascal visual object classes challenge: a retrospective[J]. International Journal of Computer Vision, 2015, 111(1): 98-136.
[60] GEIGER A, LENZ P, URTASUN R. Are we ready for auto-nomous driving? The KITTI vision benchmark suite[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, Jun 16-21, 2012. Washington: IEEE Computer Society, 2012: 3354-3361.
[61] GEIGER A, LENZ P, STILLER C, et al. Vision meets robo-tics: the KITTI dataset[J]. International Journal of Robotics Research, 2013, 32(11): 1231-1237.
[62] BEHRENDT K, NOVAK L, BOTROS R. A deep learning approach to traffic lights: detection, tracking, and classifica-tion[C]//Proceedings of the 2017 IEEE International Con-ference on Robotics and Automation, Singapore, May 29-Jun 3, 2017. Piscataway: IEEE, 2017: 1370-1377.
[63] HAN W, KHORRAMI P, PAINE T L, et al. Seq-NMS for video object detection[J]. arXiv:1602.08465, 2016.
[64] BELHASSEN H, ZHANG H, FRESSE V, et al. Improving video object detection by Seq-Bbox matching[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 226-233.
[65] SABATER A, MONTESANO L, MURILLO A C. Robust and efficient post-processing for video object detection[C]//Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway: IEEE, 2020: 10536-10542.
[66] ZHANG Z, CHENG D, ZHU X, et al. Integrated object detection and tracking with tracklet-conditioned detection[J]. arXiv:1811.11167, 2018.
[67] LIU M, ZHU M. Mobile video object detection with temporally-aware feature maps[C]//Proceedings of the 2018 IEEE Con-ference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 5686-5695.
[68] CHEN K, WANG J, YANG S, et al. Optimizing video object detection via a scale-time lattice[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7814-7823.
[69] WANG S, ZHOU Y, YAN J, et al. Fully motion-aware network for video object detection[C]//LNCS 11217: Pro-ceedings of the 2018 European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 542-557.
[70] WU H, CHEN Y, WANG N, et al. Sequence level semantics aggregation for video object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 9217-9225.
[71] DENG H, HUA Y, SONG T, et al. Object guided external memory network for video object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 6678-6687.
[72] DENG J, PAN Y, YAO T, et al. Relation distillation networks for video object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 7023-7032.
[73] XIAO F, LEE Y J. Video object detection with an aligned spatial-temporal memory[C]//LNCS 11217: Proceedings of the 2018 European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 485-501.
[74] BERTASIUS G, TORRESANI L, SHI J. Object detection in video with spatiotemporal sampling networks[C]//LNCS 11217: Proceedings of the 2018 European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 331-346.
[75] BENDRE N, MARíN H T, NAJAFIRAD P. Learning from few samples: a survey[J]. arXiv:2007.15484, 2020.
[76] WANG Y Q, YAO Q M, KWOK J T, et al. Generalizing from a few examples: a survey on few-shot learning[J]. ACM Computing Surveys, 2020, 53(3): 1-34.
[77] SUN Q R, LIU Y Y, CHUA T S, et al. Meta-transfer learning for few-shot learning[C]//Proceedings of the 2019 IEEE Con-ference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 403-412.
[78] YU X D, ALOIMONOS Y. Attribute-based transfer learning for object categorization with zero/one training example[C]//LNCS 6315: Proceedings of the 11th European Conference on Computer Vision, Heraklion, Sep 5-11, 2010. Berlin, Heidelberg: Springer, 2010: 127-140.
[79] REN M Y, TRIANTAFILLOU E, RAVI S, et al. Meta-learning for semi-supervised few-shot classification[J]. arXiv:1803.00676, 2018.
[80] JAMAL M A, QI G J. Task agnostic meta-learning for few-shot learning[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 11719-11727.
[81] WANG Y X, RAMANAN D, HEBERT M. Meta-learning to detect rare objects[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 9924-9933.
[82] HAO F S, HE F X, CHENG J, et al. Collect and select: semantic alignment metric learning for few-shot learning[C]//Proceedings of the 2019 IEEE/CVF International Con-ference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 8459-8468.
[83] SCHWARTZ E, KARLINSKY L, SHTOK J, et al. RepMet: representative-based metric learning for classification and one-shot object detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recogni-tion, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 5197-5206.