计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (9): 1563-1577.DOI: 10.3778/j.issn.1673-9418.2103107

• 综述·探索 • 上一篇    下一篇

基于深度学习的视频目标检测综述

王迪聪,白晨帅,邬开俊   

  1. 1. 兰州交通大学 电子与信息工程学院,兰州 730070
    2. 天津大学 智能与计算学部,天津 300350
  • 出版日期:2021-09-01 发布日期:2021-09-06

Survey of Video Object Detection Based on Deep Learning

WANG Dicong, BAI Chenshuai, WU Kaijun   

  1. 1. School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
    2. College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
  • Online:2021-09-01 Published:2021-09-06

摘要:

视频目标检测是为了解决每一个视频帧中出现的目标如何进行定位和识别的问题。相比于图像目标检测,视频具有高冗余度的特性,其中包含了大量的时空局部信息。随着深度卷积神经网络在静态图像目标检测领域的迅速普及,在性能上相较于传统方法显示出了非常大的优越性,并逐步在基于视频的目标检测任务上也发挥了应有的作用。但现有的视频目标检测算法仍然面临改进与优化主流目标检测算法的性能、保持视频序列的时空一致性、检测模型轻量化等关键技术的挑战。针对上述问题和挑战,在调研大量文献的基础上系统地对基于深度学习的视频目标检测算法进行了总结。从基于光流、检测等基础方法对这些算法进行了分类,从骨干网络、算法结构、数据集等角度细致探究了这些方法。结合在ImageNet VID等数据集上的实验结果,分析了该领域具有代表性算法的性能优势和劣势,以及算法之间存在的联系。对视频目标检测中待解决的问题与未来研究方向进行了阐述和展望。视频目标检测已成为众多的计算机视觉领域学者追逐的热点,将来会有更加高效、精度更高的算法被相继提出,其发展方向也会越来越好。

关键词: 深度学习, 视频目标检测, 光流, 轻量化

Abstract:

Video object detection is to solve the problem of object localization and recognition in every video frame. Compared with image object detection, video is featured by high redundancy, which contains a lot of local spatio-temporal information. With the rapid popularity of deep convolutional neural network in the field of static image object detection, it shows a great advantage over traditional methods in performance. Besides, it plays a due role in video-based object detection task. However, the current video object detection algorithms still face many challenges, such as improving and optimizing the performance of mainstream object detection algorithms, maintaining the spatiotemporal consistency of video sequences, and making detection of model lightweight. In view of the above problems and challenges, on the basis of investigating a large number of literature, this paper systematically sum-marizes the video object detection algorithm based on deep learning. Based on the basic methods like optical flow and detection, these algorithms are classified. In addition, in the angles of backbone network, algorithm structure and data sets etc., these methods are explored. Combined with the experimental results in the ImageNet VID data set, this paper analyzes the performance advantages and disadvantages of typical algorithms of this field, and the relations between these algorithms. As for video object detection, the problems to be solved as well as the future research direction are expounded and prospected. Video object detection has become a hot spot pursued by many computer vision scholars. More efficient and accurate algorithms will be proposed, and its development direction will be better and better.

Key words: deep learning, video object detection, optical flow, lightweight