计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (3): 438-455.DOI: 10.3778/j.issn.1673-9418.2009095

• 综述·探索 • 上一篇    下一篇

基于深度学习的人体动作识别综述

钱慧芳,易剑平,付云虎   

  1. 西安工程大学 电子信息学院,西安 710048
  • 出版日期:2021-03-01 发布日期:2021-03-05

Review of Human Action Recognition Based on Deep Learning

QIAN Huifang, YI Jianping, FU Yunhu   

  1. School of Electronics and Information, Xi'an Polytechnic University, Xi'an 710048, China
  • Online:2021-03-01 Published:2021-03-05

摘要:

人体动作识别是视频理解领域的重要课题之一,在视频监控、人机交互、运动分析、视频信息检索等方面有着广泛的应用。根据骨干网络的特点,从2D卷积神经网络、3D卷积神经网络、时空分解网络三个角度介绍了动作识别领域的最新研究成果,并对三类方法的优缺点进行了定性的分析和比较。然后,从场景相关和时间相关两方面,全面归纳了常用的动作视频数据集,并着重探讨了不同数据集的特点及用法。随后,介绍了动作识别任务中常见的预训练策略,并着重分析了预训练技术对动作识别模型性能的影响。最后,从最新的研究动态出发,从细粒度动作识别、更精简的模型、小样本学习、无监督学习、自适应网络和视频超分辨动作识别六个角度一致探讨了动作识别未来发展的方向。

关键词: 人体动作识别, 2D卷积神经网络(2D CNN), 3D卷积神经网络(3D CNN), 时空分解网络, 预训练

Abstract:

Human action recognition is one of the important topics in video understanding. It is widely used in video surveillance, human-computer interaction, motion analysis, and video information retrieval. According to the chara-cteristics of the backbone network, this paper introduces the latest research results in the field of action recognition from three perspectives: 2D convolutional neural network, 3D convolutional neural network, and spatiotemporal decomposition network. And their advantages and disadvantages are qualitatively analyzed and compared. Then, from the two aspects of scene-related and temporal-related, the commonly used action video datasets are comprehensively summarized, and the characteristics and usage of different datasets are emphatically discussed. Subsequently, the common pre-training strategies in action recognition tasks are introduced, and the influence of pre-training techniques on the performance of action recognition models is emphatically analyzed. Finally, starting from the latest research trends, the future development direction of action recognition is discussed from six perspectives: fine-grained action recognition, streamlined model, few-shot learning, unsupervised learning, adaptive network, and video super-resolution action recognition.

Key words: human action recognition, 2D convolutional neural network (2D CNN), 3D convolutional neural net-work (3D CNN), spatiotemporal decomposition network, pre-training