人体行为识别研究综述

doi:10.3778/j.issn.1673-9418.2106055

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (2): 305-322.DOI: 10.3778/j.issn.1673-9418.2106055

人体行为识别研究综述

裴利沈¹, 刘少博¹^,⁺(), 赵雪专²

1.河南财经政法大学计算机与信息工程学院,郑州 450046
2.郑州航空工业管理学院智能工程学院,郑州 450046

收稿日期:2021-05-19 修回日期:2021-07-26 出版日期:2022-02-01 发布日期:2021-08-04
通讯作者: + E-mail: 2559113707@qq.com
作者简介:裴利沈（1988—）,女,河南郑州人,博士,讲师,硕士生导师,CCF会员,主要研究方向为行为识别、图像处理、计算机视觉、机器学习等。
刘少博（1999—）,男,河南郑州人,主要研究方向为人体行为识别。
赵雪专（1986—）,男,河南郑州人,博士,讲师,CCF会员,主要研究方向为目标检测、目标跟踪、显著性检测、异常行为检测、计算机视觉等。
基金资助:
国家自然科学基金(61806073);河南省重点研发与推广专项（科技攻关）基金(192102210097);河南省重点研发与推广专项（科技攻关）基金(192102210126);河南省重点研发与推广专项（科技攻关）基金(212102210160)

Review of Human Behavior Recognition Research

PEI Lishen¹, LIU Shaobo¹^,⁺(), ZHAO Xuezhuan²

1. School of Computer and Information Engineering, Henan University of Economics and Law, Zhengzhou 450046, China
2. School of Intelligent Engineering, Zhengzhou University of Aeronautics, Zhengzhou 450046, China

Received:2021-05-19 Revised:2021-07-26 Online:2022-02-01 Published:2021-08-04
About author:PEI Lishen, born in 1988, Ph.D., lecturer, M.S. supervisor, member of CCF. Her research interests include action recognition, image processing, computer vision, machine learning, etc.
LIU Shaobo, born in 1999. His research interest is human action recognition.
ZHAO Xuezhuan, born in 1986, Ph.D., lecturer, member of CCF. His research interests include object detection, object tracking, saliency detection, abnormal behavior detection, computer vision, etc.
Supported by:
National Natural Science Foundation of China(61806073);Key Research & Development and Promotion Project of Henan Province(192102210097);Key Research & Development and Promotion Project of Henan Province(192102210126);Key Research & Development and Promotion Project of Henan Province(212102210160)

摘要/Abstract

摘要：

行为识别是计算机视觉领域意义重大的热点研究问题,它经历了从手工设计特征表征到深度学习特征表达的发展过程。从传统行为识别模型和深度学习模型两方面,对行为识别发展历程中产生的主流算法进行了归类梳理。传统行为识别模型主要包括基于轮廓剪影、时空兴趣点、人体关节点、运动轨迹的特征描述方法。其中改进的密集轨迹方式拥有良好的鲁棒性和可靠性;深度学习网络架构主要有双流网络、3D卷积网络和混合网络。首先,重点阐述了各行为识别算法的主要研究思路与创新点,并介绍了每类算法的模型架构、算法特色、适用情境等。然后,对广泛使用的公共行为数据库进行了分类阐述,着重对HMDB51和UCF101数据集进行了详细介绍,比较分析了传统方法和深度学习算法在各数据集上的识别效果。通过对比分析发现,传统方法不适用于高精细行为的识别,且不易实现跨数据库或跨场景的推广;深度架构中,双流网络和3D卷积网络获得了比较好的行为识别效果且被广泛使用。最后,对行为识别的未来发展进行了展望,指出了若干将来可行的研究方向。

关键词: 人体行为识别, 深度学习, 神经网络, 行为数据集

Abstract:

Behavior recognition is a hot topic in the field of computer vision. It has experienced the development process from manual design feature representation to deep learning feature expression. This paper classifies the mainstream algorithms in the development of behavior recognition from two aspects of traditional behavior recognition models and deep learning models. The traditional behavior recognition models mainly include feature description methods based on silhouette, space-time interest points, human joint point and trajectories. Among them, the improved dense trajectory method has good robustness and reliability. Deep learning network architecture mainly includes two-stream network, 3D convolution network and hybrid network. Firstly, this paper focuses on the main research ideas and innovations of each behavior recognition algorithm, and introducees the model architecture, algorithm features, application scenarios of each kind of algorithm. Then, the widely used public behavior databases are classified, and the HMDB51 and UCF101 datasets are introduced in detail. The recognition effects of traditional methods and deep learning algorithms on each dataset are compared and analyzed. Through comparative analysis, the traditional methods are not suitable for high-precision behavior recognition, and it is not easy to achieve cross database or cross scene promotion. In depth architecture, two-stream network and 3D convolution network have achieved good behavior recognition effect and are widely used. Finally, the future development of behavior recognition is prospected, and some feasible research directions in the future are pointed out.

Key words: human behavior recognition, deep learning, neural network, behavior dataset

中图分类号:

TP391

裴利沈, 刘少博, 赵雪专. 人体行为识别研究综述[J]. 计算机科学与探索, 2022, 16(2): 305-322.

PEI Lishen, LIU Shaobo, ZHAO Xuezhuan. Review of Human Behavior Recognition Research[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(2): 305-322.

图/表 15

参考文献 77

[1]	TAKANO W, NAKAMURA Y. Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions[J]. The International Journal of Robotics Research, 2015, 34(10):1314-1328. DOI URL
[2]	ZHANG W H, SMITH M L, SMITH L N, et al. Gender and gaze gesture recognition for human-computer interaction[J]. Computer Vision and Image Understanding, 2016, 149:32-50. DOI URL
[3]	WANG X G. Intelligent multi-camera video surveillance: a review[J]. Pattern Recognition Letters, 2013, 34(1):3-19. DOI URL
[4]	CAMPORESI C, KALLMANN M, HAN J J, et al. VR solutions for improving physical therapy[C]//Proceedings of the 2013 IEEE Virtual Reality, Lake Buena Vista, Mar 18-20, 2013. Washington: IEEE Computer Society, 2013: 77-78.
[5]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//LNCS 9912: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 20-36.
[6]	ZHOU B L, ANDONIAN A, Oliva A, et al. Temporal relational reasoning in videos[C]//LNCS 11205: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 831-846.
[7]	FEICHTENHORFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 6201-6210.
[8]	TRAN D, BOURDEV L D, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-12, 2015. Washington: IEEE Computer Society, 2015: 4489-4497.
[9]	QIU Z F, YAO T, TAO M. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 5533-5541.
[10]	DONAHUE J, HENDRICKS L A, ROHRBACH M, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 2625-2634.
[11]	LI Z Y, GAVRILYUK K, GAVVES E, et al. VideoLSTM convolves, attends and flows for action recognition[J]. Computer Vision and Image Understanding, 2018, 166:41-50. DOI URL
[12]	BOBICK A F, DAVIS J W. The recognition of human movement using temporal templates[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(3):257-267. DOI URL
[13]	YILMAZ A, SHAH M. Actions sketch: a novel action representation[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, Jun 20-26, 2005. Washington: IEEE Computer Society, 2005: 984-989.
[14]	MATIKAINEN P, HEBERT M, SUKTHANKAR R. Trajectons: action recognition through the motion analysis of tracked features[C]//Proceedings of the 12th IEEE International Conference on Computer Vision Workshops, Kyoto, Sep 27-Oct 4, 2009. Washington: IEEE Computer Society, 2009: 514-521.
[15]	FUJIYOSHI H, LIPTON A J, KANADE T. Real-time human motion analysis by image skeletonization[J]. IEICE Transactions on Information and Systems, 2004, E87-D(1):113-120.
[16]	YANG X D, TIAN Y L. Effective 3D action recognition using eigenjoints[J]. Journal of Visual Communication and Image Representation, 2014, 25(1):2-11. DOI URL
[17]	张恒鑫, 叶颖诗, 蔡贤资, 等. 基于人体关节点的高效动作识别算法[J]. 计算机工程与设计, 2020, 41(11):3168-3174.
	ZHANG H X, YE Y S, CAI X Z, et al. Efficient algorithm of action recognition based on joint points[J]. Computer Engineering and Design, 2020, 41(11):3168-3174.
[18]	LAPTEV I. On space-time interest points[J]. International JournaI of Computer Vision, 2005, 64(2/3):107-123.
[19]	DOLLAR P, RABAUD V, COTTRELL G W, et al. Behavior recognition via sparse spatio-temporal features[C]//Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, Oct 15-16, 2005. Piscataway: IEEE, 2005: 65-72.
[20]	WANG H, ULLAH M M, KLÄSER A, et al. Evaluation of local spatio-temporal features for action recognition[C]//Proceedings of the British Machine Vision Conference, London, Sep 7-10, 2009. Durham: BMVA Press, 2009: 1-11.
[21]	WILLEMS G, TUYTELAARS T, VAN GOOL L. An efficient dense and scale-invariant spatio-temporal interest point detector[C]//LNCS 5303: Proceedings of the 10th European Conference on Computer Vision, Marseille, Oct 12-18, 2008. Berlin, Heidelberg: Springer, 2008: 650-663.
[22]	陈艳, 胡荣, 李升健, 等. 基于组合特征和SVM的视频中人体行为识别算法[J]. 沈阳工业大学学报, 2020, 42(6):665-669.
	CHEN Y, HU R, LI S J, et al. Recognition algorithm for human behavior in video based on combined features and SVM[J]. Journal of Shenyang University of Technology, 2020, 42(6):665-669.
[23]	WANG H, KLÄSER A, SCHMID C, et al. Action recognition by dense trajectories[C]//Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, Jun 20-25, 2011. Washington: IEEE Com-puter Society, 2011: 3169-3176.
[24]	WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Washington: IEEE Computer Society, 2014: 3551-3558.
[25]	李元祥, 谢林柏. 基于深度运动图和密集轨迹的行为识别算法[J]. 计算机工程与应用, 2020, 56(3):194-200.
	LI Y X, XIE L B. Human action recognition based on depth motion map and dense trajectory[J]. Computer Engineering and Applications, 2020, 56(3):194-200.
[26]	SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Red Hook: Curran Associates, 2014: 568-576.
[27]	周云, 陈淑荣. 基于双流非局部残差网络的行为识别方法[J]. 计算机应用, 2020, 40(8):2236-2240.
	ZHOU Y, CHEN S R. Behavior recognition method based on two-stream non-local residual network[J]. Journal of Computer Applications, 2020, 40(8):2236-2240.
[28]	王增强, 张文强, 张良. 引入高阶注意力机制的人体行为识别[J]. 信号处理, 2020, 36(8):1272-1279.
	WANG Z Q, ZHANG W Q, ZHANG L. Human behavior recognition with high-order attention mechanism[J]. Journal of Signal Processing, 2020, 36(8):1272-1279.
[29]	FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 1933-1941.
[30]	FEICHTENHOFER C, PINZ A, WILDES R P. Spatiotemporal residual networks for video action recognition[C]//Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Dec 5-10, 2016. Red Hook: Curran Associates, 2017: 3468-3476.
[31]	潘娜, 蒋敏, 孔军. 基于时空交互注意力模型的人体行为识别算法[J]. 激光与光电子学展, 2020, 57(18):317-325.
	PAN N, JIANG M, KONG J. Human action recognition algorithm based on spatial-temporal interactive attention model[J]. Laser & Optoelectronics Progress, 2020, 57(18):317-325.
[32]	WANG L L, GE L Z, LI R F, et al. Three-stream CNNs for action recognition[J]. Pattern Recognition Letters, 2017, 92:33-40. DOI URL
[33]	BILEN H, FERNANDO B, GAVVES E, et al. Action recognition with dynamic image networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12):2799-2813. DOI URL
[34]	BACCOUCHE M, MAMALET F, WOLF C, et al. Sequential deep learning for human action recognition[C]//LNCS 7065: Proceedings of the 2nd International Workshop on Human Behavior Understanding, Amsterdam, Nov 16, 2011. Berlin, Heidelberg: Springer, 2011: 29-39.
[35]	JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231. DOI URL
[36]	SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 4597-4605.
[37]	李梁华, 王永雄 . 高效. 3D密集残差网络及其在人体行为识别中的应用[J]. 光电工程, 2020, 47(2):23-33.
	LI L H, WANG Y X. Efficient 3D dense residual network and its application in human action recognition[J]. Opto-Electronic Engineering, 2020, 47(2):23-33.
[38]	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 4724-4733.
[39]	DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3D Convnets: new architecture and transfer learning for video classification[J]. arXiv:1711.08200, 2017.
[40]	VAROL G, LAPTEV I, SHMID C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6):1510-1517. DOI URL
[41]	张小俊, 李辰政, 孙凌宇, 等. 基于改进3D卷积神经网络的行为识别[J]. 计算机集成制造系统, 2019, 25(8):2000-2006.
	ZHANG X J, LI C Z, SUN L Y, et al. Behavior recognition method based on improved 3D convolutional neural network[J]. Computer Integrated Manufacturing Systems, 2019, 25(8):2000-2006.
[42]	HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780. DOI URL
[43]	ANDREJ K, GEORGE T, SANKETH S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 1725-1732.
[44]	祁大健, 杜慧敏, 张霞, 等. 基于上下文特征融合的行为识别算法[J]. 计算机工程与应用, 2020, 56(2):171-175.
	QI D J, DU H M, ZHANG X, et al. Behavior recognition algorithm based on context feature fusion[J]. Computer Engineering and Applications, 2020, 56(2):171-175.
[45]	NG J Y H, HAUSKNECHT M J, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 4694-4702.
[46]	马翠红, 王毅, 毛志强. 基于注意力的双流CNN的行为识别[J]. 计算机工程与设计, 2020, 41(10):2903-2906.
	MA C H, WANG Y, MAO Z Q. Action recognition of two-stream CNN based on attention[J]. Computer Engineering and Design, 2020, 40(10):2903-2906.
[47]	揭志浩, 曾明如, 周鑫恒, 等. 结合Attention-ConvLSTM的双流卷积行为识别[J]. 小型微型计算机系统, 2021, 42(2):405-408.
	JIE Z H, ZENG M R, ZHOU X H, et al. Two stream CNN with Attention-ConvLSTM on human behavior recognition[J]. Journal of Chinese Computer Systems, 2021, 42(2):405-408.
[48]	KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[C]// Proceedings of the 5th International Conference on Learning Representations, Toulon, Apr 24-26, 2017: 1-14.
[49]	WANG P C, LI Z Y, HOU Y H, et al. Action recognition based on joint trajectory maps using convolutional neural networks[C]//Proceedings of the 2016 ACM Conference on Multimedia Conference, Amsterdam, Oct 15-19, 2016. New York: ACM, 2016: 102-106.
[50]	SHAO Z P, LI Y F, GUO Y, et al. A hierarchical model for human action recognition from body-parts[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(10):2986-3000. DOI URL
[51]	HINTON G. A practical guide to training restricted Boltzmann machines[J]. Momentum, 2010, 9(1):926-947.
[52]	TAYLOR G W, FERGUS R, LECUN Y, et al. Convolutional learning of spatio-temporal features[C]//LNCS 6316: Proceedings of the 11th European Conference on Computer Vision, Heraklion, Sep 5-11, 2010. Berlin, Heidelberg: Springer, 2010: 140-153.
[53]	TRAN S N, BENETOS E, D'AVILA G A S. Learning motion-difference features using Gaussian restricted Boltzmann machines for efficient human action recognition[C]//Proceedings of the 2014 International Joint Conference on Neural Networks, Beijing, Jul 6-11, 2014. Piscataway: IEEE, 2014: 2123-2129.
[54]	WANG X L, GIRSHICK R B, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7794-7803.
[55]	CHI L, TIAN G Y, MU Y D, et al. Fast non-local neural networks with spectral residual learning[C]//Proceedings of the 27th ACM International Conference on Multimedia, Nice, Oct 21-25, 2019. New York: ACM, 2019: 2142-2151.
[56]	叶丹, 李智, 王勇军. 基于SPLDA降维和XGBoost分类器的行为识别方法研究[J]. 微电子学与计算机, 2019, 36(6):35-39.
	YE D, LI Z, WANG Y J. Research on behavior identification based on SPLDA dimensional reduction algorithm and XGBoost classifier[J]. Microelectronics & Computer, 2019, 36(6):35-39.
[57]	SCHÜLDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]//Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, Aug 23-26, 2004. Washington: IEEE Computer Society, 2004: 23-26.
[58]	MARSZALEK M, LAPTEV I, SCHMID C. Actions in context[C]//Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami, Jun 20-25, 2009. Washington: IEEE Computer Society, 2009: 2929-2936.
[59]	RODRIGUEZ M D, AHMED J, SHAH M. Action MACH a spatio-temporal maximum average correlation height filter for action recognition[C]//Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Anchorage, Jun 24-26, 2008. Washington: IEEE Computer Society, 2008: l-8.
[60]	SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402, 2012.
[61]	NIEBLES J C, CHEN C W, LI F F. Modeling temporal structure of decomposable motion segments for activity classification[C]//LNCS 6312: Proceedings of the 11th European Conference on Computer Vision, Heraklion, Sep 5-11, 2010. Berlin, Heidelberg: Springer, 2010: 392-405.
[62]	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision, Barcelona, Nov 6-13, 2011. Washington: IEEE Computer Society, 2011: 2556-2563.
[63]	DAMEN D, DOUGHTY H, FARINELLA G M, et al. The EPIC-Kitchens dataset: collection, challenges and baselines[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(11):4125-4141. DOI URL
[64]	KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics human action video dataset[J]. arXiv:1705.06950, 2017.
[65]	GU C H, SUN C, ROSS D A, et al. AVA: a video dataset of spatio-temporally localized atomic visual actions[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 6047-6056.
[66]	TANG Y S, DING D J, RAO Y M, et al. COIN: a large-scale dataset for comprehensive instructional video analysis[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 1207-1216.
[67]	ZHAO H, TORRALBA A, TORRESANI L, et al. HACS: human action clips and segments dataset for recognition and temporal localization[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 8667-8677.
[68]	LI A, THOTAKURI M, ROSS D A, et al. The AVA-Kinetics localized human actions video dataset[J]. arXiv:2005.00214, 2020.
[69]	SHAO D, ZHAO Y, DAI B, et al. FineGym: a hierarchical video dataset for fine-grained action understanding[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 2613-2622.
[70]	ONEATA D, VERBEEK J J, SCHMID C. Action and event recognition with fisher vectors on a compact feature set[C]//Proceedings of the 2013 International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Washington: IEEE Computer Society, 2013: 1817-1824.
[71]	LAN Z Z, LIN M, LI X C, et al. Beyond Gaussian pyramid: multi-skip feature stacking for action recognition[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 204-212.
[72]	FERNANDO B, GAVVES E, ORAMAS M J, et al. Modeling video evolution for action recognition[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 5378-5387.
[73]	SHI F, LAGANIÈRE R, PETRIU E M. Local part model for action recognition[J]. Image and Vision Computing, 2016, 16(11):18-28.
[74]	WANG L M, QIAO Y, TANG X O. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 4305-4314.
[75]	CHEN L, LIU Y G, MAN Y C. Two-stream CNN based on segmentation for action recognition[C]//Proceedings of the 39th China Control Conference, Liaoning, Jul 27-29, 2020. Piscataway: IEEE, 2020: 1160-1165.
[76]	李洪均, 丁宇鹏, 李超波, 等. 基于特征融合时序分割网络的行为识别研究[J]. 计算机研究与发展, 2020, 57(1):145-158.
	LI H J, DING Y P, LI C B, et al. Action recognition of temporal segment network based on feature fusion[J]. Journal of Computer Research and Development, 2020, 57(1):145-158.
[77]	TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 1010-1019.

编辑推荐 0

Metrics

阅读次数

全文

1047

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	98	108	0	841

来源	本网站	其他网站

次数	928	119
比例	89%	11%

摘要

1776

最新录用	在线预览	正式出版

44	0	1732

来源	本网站	其他网站

次数	1775	1
比例	100%	0%

方法	优点	缺点	应用场景
轮廓剪影^[12,13,14]	1. 关键区域简单 2. 信息量丰富 3. 描述能力强	1. 灵活性低 2. 对噪声和拍摄角度敏感 3. 物体遮挡时效果大大降低 4. 轮廓细节难以捕捉	背景简单,人体遮挡程度低
人体关节点^[15,16,17]	1. 不需要提取人体模型 2. 不需要大量像素 3. 价格便宜	1. 对光线和拍摄角度敏感 2. 物体遮挡时效果降低 3. 计算相对复杂	目标较小,人体遮挡程度低
时空兴趣点^{[18,19,20,21]}	1. 不需要背景剪除 2. 对场景适应性增强 3. 自动化程度增强	1. 对拍摄光线和人体遮挡敏感 2. 不同兴趣点提取方法密集度和时空复杂度不可兼得	背景相对复杂的场景
运动轨迹^[23,24,25]	1. 鲁棒性强 2. 表征能力强 3. 无视背景干扰 4. 运动信息保留完整	1. 分类器训练计算复杂度高 2. 速度慢 3. 计算复杂度高	应用场景丰富,没有太大拘束

方法	优点	缺点	应用场景
轮廓剪影^[12,13,14]	1. 关键区域简单 2. 信息量丰富 3. 描述能力强	1. 灵活性低 2. 对噪声和拍摄角度敏感 3. 物体遮挡时效果大大降低 4. 轮廓细节难以捕捉	背景简单,人体遮挡程度低
人体关节点^[15,16,17]	1. 不需要提取人体模型 2. 不需要大量像素 3. 价格便宜	1. 对光线和拍摄角度敏感 2. 物体遮挡时效果降低 3. 计算相对复杂	目标较小,人体遮挡程度低
时空兴趣点^{[18,19,20,21]}	1. 不需要背景剪除 2. 对场景适应性增强 3. 自动化程度增强	1. 对拍摄光线和人体遮挡敏感 2. 不同兴趣点提取方法密集度和时空复杂度不可兼得	背景相对复杂的场景
运动轨迹^[23,24,25]	1. 鲁棒性强 2. 表征能力强 3. 无视背景干扰 4. 运动信息保留完整	1. 分类器训练计算复杂度高 2. 速度慢 3. 计算复杂度高	应用场景丰富,没有太大拘束

模型架构	优点	缺点
双流网络架构^[5,26,29-30]	1. 注重时空信息 2. 准确率高 3. 使用广泛	1. 依赖巨大的数据量输入 2. 硬件需求高 3. 分离训练网络,耗时
3D卷积神经网络架构^{[8-9,36,38,40]}	1. 速度更快 2. 注重运动信息	1. 计算开销大,硬件要求高 2. 识别准确率比双流网络低
混合网络架构^{[10-11,45,48]}	1. 速度快,准确性高 2. 组合多样	1. 组合困难 2. 组合复杂度高

模型架构	优点	缺点
双流网络架构^[5,26,29-30]	1. 注重时空信息 2. 准确率高 3. 使用广泛	1. 依赖巨大的数据量输入 2. 硬件需求高 3. 分离训练网络,耗时
3D卷积神经网络架构^{[8-9,36,38,40]}	1. 速度更快 2. 注重运动信息	1. 计算开销大,硬件要求高 2. 识别准确率比双流网络低
混合网络架构^{[10-11,45,48]}	1. 速度快,准确性高 2. 组合多样	1. 组合困难 2. 组合复杂度高

方法	关键点	数据量	研究热度	配置要求	效果
传统方式	特征提取	少	中	低	良
深度学习	数据量支撑	多	高	高	优

人体行为识别研究综述

Review of Human Behavior Recognition Research

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 15

参考文献 77

相关文章 15

编辑推荐 0

Metrics

数据集	年份	来源	样本数量	类数	平均时长/s	实例类/行为类别
KTH^[57]	2004	志愿者拍摄	600	6	4.00	走路、慢跑、跑步、拳击、鼓掌、挥手
UCF-Sports^[59]	2008	体育电视	150	10	6.39	体育运动：潜水、高尔夫运动、踢
Hollywood2^[58]	2009	好莱坞电影	3 669	12	19.00	日常生活：吃饭、打电话、握手
HMDB51^[62]	2011	互联网、电影	6 849	51	2.00~5.00	一般面部动作、交互面部动作、一般身体动作、物体交互动作、人体交互动作
UCF101^[60]	2012	视频网站	13 320	101	5.00	人与物体交互动作、身体动作、人体交互动作、乐器演奏、运动
Sports-1M^[43]	2015	视频网站	1 133 158	487	336.00	水上运动、团队运动、冬季运动、球类运动、对抗运动、动物交互运动
ActivityNet200	2016	视频网站	19 994	200	109.00	日常生活：跳远、遛狗、擦地板
Kinetics^[64]	2017	视频网站	306 245	400	10.00	弹奏乐器、日常生活：握手
Epic-Kitchens^[63]	2018	实验拍摄	432	149	10.00	厨房日常：做饭、打扫、准备食物、洗
AVA^[65]	2018	电影	57 600	80	3.00	日常生活：行走、踢、握手
COIN^[66]	2019	视频网站	11 827	180	14.19	日常生活：接发、刮胡、熨衣、抽血
HACS^[67]	2019	视频网站	504 000	200	2.00	运动：跳绳、撑杆跳高、铲雪
AVA-Kinetics^[68]	2020	视频网站	57 600	80	3.00	日常生活：拥抱、饮酒

文献	定性分析			识别率/%			相关分析
文献	输入	机制	网络框架/特征	Holly-wood2	Olympic Sports*	Sports-1M	优势	局限	适用场景
T：DT^[23]	RGB OF	Trajectories	HOG、HOF、MBH	58.3	—	—	表征强	相机运动	视频监控
T：IDT^[24]			HOG、HOF、MBH	64.3	91.1	—	稳定、可靠	耗时	运动员训练
T：IDT^[70]			MBH、SIHFMFCC	63.5	82.1	—	鲁棒性强	环境影响	运动员训练
T：MIFS^[71]			HOG、HOF、PCA	68.0	91.4	—	计算成本小	跟踪质量偏低	体育赛事
T:Video-Darwin^[72]	RGB	temporal pooling	HOF、MBH Comb	73.7	—	—	简单、速度快	机器普适性差	厨房日常
D：ELS Fusion^[43]	RGB	CNN	slow fusion	—	—	60.9	速度快、通用	相机运动影响	球类运动识别
D：HRP+RP	RGB	CNN	VGG-16	74.1	—	—	高容量编码	参数较多	日常监控
D：LSTM^[45]	RGB OF	CNN-LSTM	Two-stream 3D ConvNet	—	—	73.1	长视频、低计算	过程复杂	运动识别

[1]	安凤平, 李晓薇, 曹翔. 权重初始化-滑动窗口CNN的医学图像分类[J]. 计算机科学与探索, 2022, 16(8): 1885-1897.
[2]	黄浩, 葛洪伟. 强化类间区分的深度残差表情识别网络[J]. 计算机科学与探索, 2022, 16(8): 1842-1849.
[3]	于慧琳, 陈炜, 王琪, 高建伟, 万怀宇. 使用子图推理实现知识图谱关系预测[J]. 计算机科学与探索, 2022, 16(8): 1800-1808.
[4]	曾凡智, 许露倩, 周燕, 周月霞, 廖俊玮. 面向智慧教育的知识追踪模型研究综述[J]. 计算机科学与探索, 2022, 16(8): 1742-1763.
[5]	刘艺, 李蒙蒙, 郑奇斌, 秦伟, 任小广. 视频目标跟踪算法综述[J]. 计算机科学与探索, 2022, 16(7): 1504-1515.
[6]	赵小明, 杨轶娇, 张石清. 面向深度学习的多模态情感识别研究进展[J]. 计算机科学与探索, 2022, 16(7): 1479-1503.
[7]	夏鸿斌, 肖奕飞, 刘渊. 融合自注意力机制的长文本生成对抗网络模型[J]. 计算机科学与探索, 2022, 16(7): 1603-1610.
[8]	李玉轩, 洪学海, 汪洋, 唐正正, 班艳. 引入激活加权策略的分组排序学习方法[J]. 计算机科学与探索, 2022, 16(7): 1594-1602.
[9]	孙方伟, 李承阳, 谢永强, 李忠博, 杨才东, 齐锦. 深度学习应用于遮挡目标检测算法综述[J]. 计算机科学与探索, 2022, 16(6): 1243-1259.
[10]	刘雅芬, 郑艺峰, 江铃燚, 李国和, 张文杰. 深度半监督学习中伪标签方法综述[J]. 计算机科学与探索, 2022, 16(6): 1279-1290.
[11]	张雁操, 赵宇海, 史岚. 融合图注意力的多特征链接预测算法[J]. 计算机科学与探索, 2022, 16(5): 1096-1106.
[12]	欧阳柳, 贺禧, 瞿绍军. 全卷积注意力机制神经网络的图像语义分割[J]. 计算机科学与探索, 2022, 16(5): 1136-1145.
[13]	程卫月, 张雪琴, 林克正, 李骜. 融合全局与局部特征的深度卷积神经网络算法[J]. 计算机科学与探索, 2022, 16(5): 1146-1154.
[14]	童敢, 黄立波. Winograd快速卷积相关研究综述[J]. 计算机科学与探索, 2022, 16(5): 959-971.
[15]	钟梦圆, 姜麟. 超分辨率图像重建算法综述[J]. 计算机科学与探索, 2022, 16(5): 972-990.

文献	定性分析			识别率/%		相关分析
文献	输入	机制	网络框架/特征	HMDB51	UCF101	优势	局限	适用场景
T：Part Model^[73]	RGB、OF	Multiscale local model	GBH、PCA	61.0	86.6	实时计算较好	复杂	视频监控
T+D：TDD^[74]		Trajectories、 Two-stream	TDD+IDT	65.9	91.5	避免手工	识别精度有限	人体交互
T+D：TDD^[74]		Trajectories、 Two-stream	TDD	63.2	90.3	避免手工	识别精度有限	人体交互
D：Segmentation^[75]	RGB、OF	Two-stream	BN-Inception Inception-v3	71.7 72.3	95.2 95.5	加强网络泛化	识别速度	日常生活
D：Attention- ConvLSTM^[47]	RGB、OF		VGG-16	69.8	94.6	强化帧间依赖	参数较多	运动场景
D：Fusion^[29]	RGB、OF		VGG-16、DT	69.2	93.5	时空融合	计算复杂性	日常生活
D：TSN^[5]	RGB、OF、WF		BN-Inception	69.4	94.2	高效学习整个视频	初始化复杂	长镜头场景
D：Sparse+TSN^[76]	RGB、OF Sparse		Inception	76.4	96.9	特征利用率提高	特征交互弱	体育运动
D：I3D^[38]	RGB、OF	3D-CNN	BN-Inception	80.7	98.0	感受野更大	相机移动影响	日常监控
D：LTC^[40]	RGB、OF	3D-CNN	depth-d	67.2	92.7	长序列探索	时间复杂度	体育运动
D：P3D+IDT^[8]	RGB、OF	2D+1D CNN	ResNet	—	93.7	参数量降低	识别效果	运动场景
D：R(2+1)D^[77]	RGB	Mixed convolution	ResNet	78.7	97.3	参数易优化	空间分辨率	日常生活
D：CNN+LSTM^[45]	RGB、OF	CNN+LSTM	AlexNet、GoogleLeNet	—	88.6	降低噪声影响	参数复杂	体育运动

文献	定性分析			识别率/%		相关分析
文献	输入	机制	网络框架/特征	HMDB51	UCF101	优势	局限	适用场景
T：Part Model^[73]	RGB、OF	Multiscale local model	GBH、PCA	61.0	86.6	实时计算较好	复杂	视频监控
T+D：TDD^[74]		Trajectories、 Two-stream	TDD+IDT	65.9	91.5	避免手工	识别精度有限	人体交互
T+D：TDD^[74]		Trajectories、 Two-stream	TDD	63.2	90.3	避免手工	识别精度有限	人体交互
D：Segmentation^[75]	RGB、OF	Two-stream	BN-Inception Inception-v3	71.7 72.3	95.2 95.5	加强网络泛化	识别速度	日常生活
D：Attention- ConvLSTM^[47]	RGB、OF		VGG-16	69.8	94.6	强化帧间依赖	参数较多	运动场景
D：Fusion^[29]	RGB、OF		VGG-16、DT	69.2	93.5	时空融合	计算复杂性	日常生活
D：TSN^[5]	RGB、OF、WF		BN-Inception	69.4	94.2	高效学习整个视频	初始化复杂	长镜头场景
D：Sparse+TSN^[76]	RGB、OF Sparse		Inception	76.4	96.9	特征利用率提高	特征交互弱	体育运动
D：I3D^[38]	RGB、OF	3D-CNN	BN-Inception	80.7	98.0	感受野更大	相机移动影响	日常监控
D：LTC^[40]	RGB、OF	3D-CNN	depth-d	67.2	92.7	长序列探索	时间复杂度	体育运动
D：P3D+IDT^[8]	RGB、OF	2D+1D CNN	ResNet	—	93.7	参数量降低	识别效果	运动场景
D：R(2+1)D^[77]	RGB	Mixed convolution	ResNet	78.7	97.3	参数易优化	空间分辨率	日常生活
D：CNN+LSTM^[45]	RGB、OF	CNN+LSTM	AlexNet、GoogleLeNet	—	88.6	降低噪声影响	参数复杂	体育运动