Review of Human Behavior Recognition Research

doi:10.3778/j.issn.1673-9418.2106055

Journal of Frontiers of Computer Science and Technology ›› 2022, Vol. 16 ›› Issue (2): 305-322.DOI: 10.3778/j.issn.1673-9418.2106055

• Surveys and Frontiers • Previous Articles Next Articles

Review of Human Behavior Recognition Research

PEI Lishen¹, LIU Shaobo¹^,⁺(), ZHAO Xuezhuan²

1. School of Computer and Information Engineering, Henan University of Economics and Law, Zhengzhou 450046, China
2. School of Intelligent Engineering, Zhengzhou University of Aeronautics, Zhengzhou 450046, China

Received:2021-05-19 Revised:2021-07-26 Online:2022-02-01 Published:2021-08-04
About author:PEI Lishen, born in 1988, Ph.D., lecturer, M.S. supervisor, member of CCF. Her research interests include action recognition, image processing, computer vision, machine learning, etc.
LIU Shaobo, born in 1999. His research interest is human action recognition.
ZHAO Xuezhuan, born in 1986, Ph.D., lecturer, member of CCF. His research interests include object detection, object tracking, saliency detection, abnormal behavior detection, computer vision, etc.
Supported by:
National Natural Science Foundation of China(61806073);Key Research & Development and Promotion Project of Henan Province(192102210097);Key Research & Development and Promotion Project of Henan Province(192102210126);Key Research & Development and Promotion Project of Henan Province(212102210160)

人体行为识别研究综述

裴利沈¹, 刘少博¹^,⁺(), 赵雪专²

1.河南财经政法大学计算机与信息工程学院,郑州 450046
2.郑州航空工业管理学院智能工程学院,郑州 450046

通讯作者: + E-mail: 2559113707@qq.com
作者简介:裴利沈（1988—）,女,河南郑州人,博士,讲师,硕士生导师,CCF会员,主要研究方向为行为识别、图像处理、计算机视觉、机器学习等。
刘少博（1999—）,男,河南郑州人,主要研究方向为人体行为识别。
赵雪专（1986—）,男,河南郑州人,博士,讲师,CCF会员,主要研究方向为目标检测、目标跟踪、显著性检测、异常行为检测、计算机视觉等。
基金资助:
国家自然科学基金(61806073);河南省重点研发与推广专项（科技攻关）基金(192102210097);河南省重点研发与推广专项（科技攻关）基金(192102210126);河南省重点研发与推广专项（科技攻关）基金(212102210160)

Abstract

Abstract:

Behavior recognition is a hot topic in the field of computer vision. It has experienced the development process from manual design feature representation to deep learning feature expression. This paper classifies the mainstream algorithms in the development of behavior recognition from two aspects of traditional behavior recognition models and deep learning models. The traditional behavior recognition models mainly include feature description methods based on silhouette, space-time interest points, human joint point and trajectories. Among them, the improved dense trajectory method has good robustness and reliability. Deep learning network architecture mainly includes two-stream network, 3D convolution network and hybrid network. Firstly, this paper focuses on the main research ideas and innovations of each behavior recognition algorithm, and introducees the model architecture, algorithm features, application scenarios of each kind of algorithm. Then, the widely used public behavior databases are classified, and the HMDB51 and UCF101 datasets are introduced in detail. The recognition effects of traditional methods and deep learning algorithms on each dataset are compared and analyzed. Through comparative analysis, the traditional methods are not suitable for high-precision behavior recognition, and it is not easy to achieve cross database or cross scene promotion. In depth architecture, two-stream network and 3D convolution network have achieved good behavior recognition effect and are widely used. Finally, the future development of behavior recognition is prospected, and some feasible research directions in the future are pointed out.

Key words: human behavior recognition, deep learning, neural network, behavior dataset

摘要：

行为识别是计算机视觉领域意义重大的热点研究问题,它经历了从手工设计特征表征到深度学习特征表达的发展过程。从传统行为识别模型和深度学习模型两方面,对行为识别发展历程中产生的主流算法进行了归类梳理。传统行为识别模型主要包括基于轮廓剪影、时空兴趣点、人体关节点、运动轨迹的特征描述方法。其中改进的密集轨迹方式拥有良好的鲁棒性和可靠性;深度学习网络架构主要有双流网络、3D卷积网络和混合网络。首先,重点阐述了各行为识别算法的主要研究思路与创新点,并介绍了每类算法的模型架构、算法特色、适用情境等。然后,对广泛使用的公共行为数据库进行了分类阐述,着重对HMDB51和UCF101数据集进行了详细介绍,比较分析了传统方法和深度学习算法在各数据集上的识别效果。通过对比分析发现,传统方法不适用于高精细行为的识别,且不易实现跨数据库或跨场景的推广;深度架构中,双流网络和3D卷积网络获得了比较好的行为识别效果且被广泛使用。最后,对行为识别的未来发展进行了展望,指出了若干将来可行的研究方向。

关键词: 人体行为识别, 深度学习, 神经网络, 行为数据集

CLC Number:

TP391

PEI Lishen, LIU Shaobo, ZHAO Xuezhuan. Review of Human Behavior Recognition Research[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(2): 305-322.

裴利沈, 刘少博, 赵雪专. 人体行为识别研究综述[J]. 计算机科学与探索, 2022, 16(2): 305-322.

Figures/Tables 15

References 77

[1]	TAKANO W, NAKAMURA Y. Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions[J]. The International Journal of Robotics Research, 2015, 34(10):1314-1328. DOI URL
[2]	ZHANG W H, SMITH M L, SMITH L N, et al. Gender and gaze gesture recognition for human-computer interaction[J]. Computer Vision and Image Understanding, 2016, 149:32-50. DOI URL
[3]	WANG X G. Intelligent multi-camera video surveillance: a review[J]. Pattern Recognition Letters, 2013, 34(1):3-19. DOI URL
[4]	CAMPORESI C, KALLMANN M, HAN J J, et al. VR solutions for improving physical therapy[C]//Proceedings of the 2013 IEEE Virtual Reality, Lake Buena Vista, Mar 18-20, 2013. Washington: IEEE Computer Society, 2013: 77-78.
[5]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//LNCS 9912: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 20-36.
[6]	ZHOU B L, ANDONIAN A, Oliva A, et al. Temporal relational reasoning in videos[C]//LNCS 11205: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 831-846.
[7]	FEICHTENHORFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 6201-6210.
[8]	TRAN D, BOURDEV L D, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-12, 2015. Washington: IEEE Computer Society, 2015: 4489-4497.
[9]	QIU Z F, YAO T, TAO M. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 5533-5541.
[10]	DONAHUE J, HENDRICKS L A, ROHRBACH M, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 2625-2634.
[11]	LI Z Y, GAVRILYUK K, GAVVES E, et al. VideoLSTM convolves, attends and flows for action recognition[J]. Computer Vision and Image Understanding, 2018, 166:41-50. DOI URL
[12]	BOBICK A F, DAVIS J W. The recognition of human movement using temporal templates[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(3):257-267. DOI URL
[13]	YILMAZ A, SHAH M. Actions sketch: a novel action representation[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, Jun 20-26, 2005. Washington: IEEE Computer Society, 2005: 984-989.
[14]	MATIKAINEN P, HEBERT M, SUKTHANKAR R. Trajectons: action recognition through the motion analysis of tracked features[C]//Proceedings of the 12th IEEE International Conference on Computer Vision Workshops, Kyoto, Sep 27-Oct 4, 2009. Washington: IEEE Computer Society, 2009: 514-521.
[15]	FUJIYOSHI H, LIPTON A J, KANADE T. Real-time human motion analysis by image skeletonization[J]. IEICE Transactions on Information and Systems, 2004, E87-D(1):113-120.
[16]	YANG X D, TIAN Y L. Effective 3D action recognition using eigenjoints[J]. Journal of Visual Communication and Image Representation, 2014, 25(1):2-11. DOI URL
[17]	张恒鑫, 叶颖诗, 蔡贤资, 等. 基于人体关节点的高效动作识别算法[J]. 计算机工程与设计, 2020, 41(11):3168-3174.
	ZHANG H X, YE Y S, CAI X Z, et al. Efficient algorithm of action recognition based on joint points[J]. Computer Engineering and Design, 2020, 41(11):3168-3174.
[18]	LAPTEV I. On space-time interest points[J]. International JournaI of Computer Vision, 2005, 64(2/3):107-123.
[19]	DOLLAR P, RABAUD V, COTTRELL G W, et al. Behavior recognition via sparse spatio-temporal features[C]//Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, Oct 15-16, 2005. Piscataway: IEEE, 2005: 65-72.
[20]	WANG H, ULLAH M M, KLÄSER A, et al. Evaluation of local spatio-temporal features for action recognition[C]//Proceedings of the British Machine Vision Conference, London, Sep 7-10, 2009. Durham: BMVA Press, 2009: 1-11.
[21]	WILLEMS G, TUYTELAARS T, VAN GOOL L. An efficient dense and scale-invariant spatio-temporal interest point detector[C]//LNCS 5303: Proceedings of the 10th European Conference on Computer Vision, Marseille, Oct 12-18, 2008. Berlin, Heidelberg: Springer, 2008: 650-663.
[22]	陈艳, 胡荣, 李升健, 等. 基于组合特征和SVM的视频中人体行为识别算法[J]. 沈阳工业大学学报, 2020, 42(6):665-669.
	CHEN Y, HU R, LI S J, et al. Recognition algorithm for human behavior in video based on combined features and SVM[J]. Journal of Shenyang University of Technology, 2020, 42(6):665-669.
[23]	WANG H, KLÄSER A, SCHMID C, et al. Action recognition by dense trajectories[C]//Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, Jun 20-25, 2011. Washington: IEEE Com-puter Society, 2011: 3169-3176.
[24]	WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Washington: IEEE Computer Society, 2014: 3551-3558.
[25]	李元祥, 谢林柏. 基于深度运动图和密集轨迹的行为识别算法[J]. 计算机工程与应用, 2020, 56(3):194-200.
	LI Y X, XIE L B. Human action recognition based on depth motion map and dense trajectory[J]. Computer Engineering and Applications, 2020, 56(3):194-200.
[26]	SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Red Hook: Curran Associates, 2014: 568-576.
[27]	周云, 陈淑荣. 基于双流非局部残差网络的行为识别方法[J]. 计算机应用, 2020, 40(8):2236-2240.
	ZHOU Y, CHEN S R. Behavior recognition method based on two-stream non-local residual network[J]. Journal of Computer Applications, 2020, 40(8):2236-2240.
[28]	王增强, 张文强, 张良. 引入高阶注意力机制的人体行为识别[J]. 信号处理, 2020, 36(8):1272-1279.
	WANG Z Q, ZHANG W Q, ZHANG L. Human behavior recognition with high-order attention mechanism[J]. Journal of Signal Processing, 2020, 36(8):1272-1279.
[29]	FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 1933-1941.
[30]	FEICHTENHOFER C, PINZ A, WILDES R P. Spatiotemporal residual networks for video action recognition[C]//Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Dec 5-10, 2016. Red Hook: Curran Associates, 2017: 3468-3476.
[31]	潘娜, 蒋敏, 孔军. 基于时空交互注意力模型的人体行为识别算法[J]. 激光与光电子学展, 2020, 57(18):317-325.
	PAN N, JIANG M, KONG J. Human action recognition algorithm based on spatial-temporal interactive attention model[J]. Laser & Optoelectronics Progress, 2020, 57(18):317-325.
[32]	WANG L L, GE L Z, LI R F, et al. Three-stream CNNs for action recognition[J]. Pattern Recognition Letters, 2017, 92:33-40. DOI URL
[33]	BILEN H, FERNANDO B, GAVVES E, et al. Action recognition with dynamic image networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12):2799-2813. DOI URL
[34]	BACCOUCHE M, MAMALET F, WOLF C, et al. Sequential deep learning for human action recognition[C]//LNCS 7065: Proceedings of the 2nd International Workshop on Human Behavior Understanding, Amsterdam, Nov 16, 2011. Berlin, Heidelberg: Springer, 2011: 29-39.
[35]	JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231. DOI URL
[36]	SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 4597-4605.
[37]	李梁华, 王永雄 . 高效. 3D密集残差网络及其在人体行为识别中的应用[J]. 光电工程, 2020, 47(2):23-33.
	LI L H, WANG Y X. Efficient 3D dense residual network and its application in human action recognition[J]. Opto-Electronic Engineering, 2020, 47(2):23-33.
[38]	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 4724-4733.
[39]	DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3D Convnets: new architecture and transfer learning for video classification[J]. arXiv:1711.08200, 2017.
[40]	VAROL G, LAPTEV I, SHMID C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6):1510-1517. DOI URL
[41]	张小俊, 李辰政, 孙凌宇, 等. 基于改进3D卷积神经网络的行为识别[J]. 计算机集成制造系统, 2019, 25(8):2000-2006.
	ZHANG X J, LI C Z, SUN L Y, et al. Behavior recognition method based on improved 3D convolutional neural network[J]. Computer Integrated Manufacturing Systems, 2019, 25(8):2000-2006.
[42]	HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780. DOI URL
[43]	ANDREJ K, GEORGE T, SANKETH S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 1725-1732.
[44]	祁大健, 杜慧敏, 张霞, 等. 基于上下文特征融合的行为识别算法[J]. 计算机工程与应用, 2020, 56(2):171-175.
	QI D J, DU H M, ZHANG X, et al. Behavior recognition algorithm based on context feature fusion[J]. Computer Engineering and Applications, 2020, 56(2):171-175.
[45]	NG J Y H, HAUSKNECHT M J, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 4694-4702.
[46]	马翠红, 王毅, 毛志强. 基于注意力的双流CNN的行为识别[J]. 计算机工程与设计, 2020, 41(10):2903-2906.
	MA C H, WANG Y, MAO Z Q. Action recognition of two-stream CNN based on attention[J]. Computer Engineering and Design, 2020, 40(10):2903-2906.
[47]	揭志浩, 曾明如, 周鑫恒, 等. 结合Attention-ConvLSTM的双流卷积行为识别[J]. 小型微型计算机系统, 2021, 42(2):405-408.
	JIE Z H, ZENG M R, ZHOU X H, et al. Two stream CNN with Attention-ConvLSTM on human behavior recognition[J]. Journal of Chinese Computer Systems, 2021, 42(2):405-408.
[48]	KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[C]// Proceedings of the 5th International Conference on Learning Representations, Toulon, Apr 24-26, 2017: 1-14.
[49]	WANG P C, LI Z Y, HOU Y H, et al. Action recognition based on joint trajectory maps using convolutional neural networks[C]//Proceedings of the 2016 ACM Conference on Multimedia Conference, Amsterdam, Oct 15-19, 2016. New York: ACM, 2016: 102-106.
[50]	SHAO Z P, LI Y F, GUO Y, et al. A hierarchical model for human action recognition from body-parts[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(10):2986-3000. DOI URL
[51]	HINTON G. A practical guide to training restricted Boltzmann machines[J]. Momentum, 2010, 9(1):926-947.
[52]	TAYLOR G W, FERGUS R, LECUN Y, et al. Convolutional learning of spatio-temporal features[C]//LNCS 6316: Proceedings of the 11th European Conference on Computer Vision, Heraklion, Sep 5-11, 2010. Berlin, Heidelberg: Springer, 2010: 140-153.
[53]	TRAN S N, BENETOS E, D'AVILA G A S. Learning motion-difference features using Gaussian restricted Boltzmann machines for efficient human action recognition[C]//Proceedings of the 2014 International Joint Conference on Neural Networks, Beijing, Jul 6-11, 2014. Piscataway: IEEE, 2014: 2123-2129.
[54]	WANG X L, GIRSHICK R B, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7794-7803.
[55]	CHI L, TIAN G Y, MU Y D, et al. Fast non-local neural networks with spectral residual learning[C]//Proceedings of the 27th ACM International Conference on Multimedia, Nice, Oct 21-25, 2019. New York: ACM, 2019: 2142-2151.
[56]	叶丹, 李智, 王勇军. 基于SPLDA降维和XGBoost分类器的行为识别方法研究[J]. 微电子学与计算机, 2019, 36(6):35-39.
	YE D, LI Z, WANG Y J. Research on behavior identification based on SPLDA dimensional reduction algorithm and XGBoost classifier[J]. Microelectronics & Computer, 2019, 36(6):35-39.
[57]	SCHÜLDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]//Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, Aug 23-26, 2004. Washington: IEEE Computer Society, 2004: 23-26.
[58]	MARSZALEK M, LAPTEV I, SCHMID C. Actions in context[C]//Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami, Jun 20-25, 2009. Washington: IEEE Computer Society, 2009: 2929-2936.
[59]	RODRIGUEZ M D, AHMED J, SHAH M. Action MACH a spatio-temporal maximum average correlation height filter for action recognition[C]//Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Anchorage, Jun 24-26, 2008. Washington: IEEE Computer Society, 2008: l-8.
[60]	SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402, 2012.
[61]	NIEBLES J C, CHEN C W, LI F F. Modeling temporal structure of decomposable motion segments for activity classification[C]//LNCS 6312: Proceedings of the 11th European Conference on Computer Vision, Heraklion, Sep 5-11, 2010. Berlin, Heidelberg: Springer, 2010: 392-405.
[62]	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision, Barcelona, Nov 6-13, 2011. Washington: IEEE Computer Society, 2011: 2556-2563.
[63]	DAMEN D, DOUGHTY H, FARINELLA G M, et al. The EPIC-Kitchens dataset: collection, challenges and baselines[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(11):4125-4141. DOI URL
[64]	KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics human action video dataset[J]. arXiv:1705.06950, 2017.
[65]	GU C H, SUN C, ROSS D A, et al. AVA: a video dataset of spatio-temporally localized atomic visual actions[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 6047-6056.
[66]	TANG Y S, DING D J, RAO Y M, et al. COIN: a large-scale dataset for comprehensive instructional video analysis[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 1207-1216.
[67]	ZHAO H, TORRALBA A, TORRESANI L, et al. HACS: human action clips and segments dataset for recognition and temporal localization[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 8667-8677.
[68]	LI A, THOTAKURI M, ROSS D A, et al. The AVA-Kinetics localized human actions video dataset[J]. arXiv:2005.00214, 2020.
[69]	SHAO D, ZHAO Y, DAI B, et al. FineGym: a hierarchical video dataset for fine-grained action understanding[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 2613-2622.
[70]	ONEATA D, VERBEEK J J, SCHMID C. Action and event recognition with fisher vectors on a compact feature set[C]//Proceedings of the 2013 International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Washington: IEEE Computer Society, 2013: 1817-1824.
[71]	LAN Z Z, LIN M, LI X C, et al. Beyond Gaussian pyramid: multi-skip feature stacking for action recognition[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 204-212.
[72]	FERNANDO B, GAVVES E, ORAMAS M J, et al. Modeling video evolution for action recognition[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 5378-5387.
[73]	SHI F, LAGANIÈRE R, PETRIU E M. Local part model for action recognition[J]. Image and Vision Computing, 2016, 16(11):18-28.
[74]	WANG L M, QIAO Y, TANG X O. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 4305-4314.
[75]	CHEN L, LIU Y G, MAN Y C. Two-stream CNN based on segmentation for action recognition[C]//Proceedings of the 39th China Control Conference, Liaoning, Jul 27-29, 2020. Piscataway: IEEE, 2020: 1160-1165.
[76]	李洪均, 丁宇鹏, 李超波, 等. 基于特征融合时序分割网络的行为识别研究[J]. 计算机研究与发展, 2020, 57(1):145-158.
	LI H J, DING Y P, LI C B, et al. Action recognition of temporal segment network based on feature fusion[J]. Journal of Computer Research and Development, 2020, 57(1):145-158.
[77]	TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 1010-1019.

方法	优点	缺点	应用场景
轮廓剪影^[12,13,14]	1. 关键区域简单 2. 信息量丰富 3. 描述能力强	1. 灵活性低 2. 对噪声和拍摄角度敏感 3. 物体遮挡时效果大大降低 4. 轮廓细节难以捕捉	背景简单,人体遮挡程度低
人体关节点^[15,16,17]	1. 不需要提取人体模型 2. 不需要大量像素 3. 价格便宜	1. 对光线和拍摄角度敏感 2. 物体遮挡时效果降低 3. 计算相对复杂	目标较小,人体遮挡程度低
时空兴趣点^{[18,19,20,21]}	1. 不需要背景剪除 2. 对场景适应性增强 3. 自动化程度增强	1. 对拍摄光线和人体遮挡敏感 2. 不同兴趣点提取方法密集度和时空复杂度不可兼得	背景相对复杂的场景
运动轨迹^[23,24,25]	1. 鲁棒性强 2. 表征能力强 3. 无视背景干扰 4. 运动信息保留完整	1. 分类器训练计算复杂度高 2. 速度慢 3. 计算复杂度高	应用场景丰富,没有太大拘束

方法	优点	缺点	应用场景
轮廓剪影^[12,13,14]	1. 关键区域简单 2. 信息量丰富 3. 描述能力强	1. 灵活性低 2. 对噪声和拍摄角度敏感 3. 物体遮挡时效果大大降低 4. 轮廓细节难以捕捉	背景简单,人体遮挡程度低
人体关节点^[15,16,17]	1. 不需要提取人体模型 2. 不需要大量像素 3. 价格便宜	1. 对光线和拍摄角度敏感 2. 物体遮挡时效果降低 3. 计算相对复杂	目标较小,人体遮挡程度低
时空兴趣点^{[18,19,20,21]}	1. 不需要背景剪除 2. 对场景适应性增强 3. 自动化程度增强	1. 对拍摄光线和人体遮挡敏感 2. 不同兴趣点提取方法密集度和时空复杂度不可兼得	背景相对复杂的场景
运动轨迹^[23,24,25]	1. 鲁棒性强 2. 表征能力强 3. 无视背景干扰 4. 运动信息保留完整	1. 分类器训练计算复杂度高 2. 速度慢 3. 计算复杂度高	应用场景丰富,没有太大拘束

模型架构	优点	缺点
双流网络架构^[5,26,29-30]	1. 注重时空信息 2. 准确率高 3. 使用广泛	1. 依赖巨大的数据量输入 2. 硬件需求高 3. 分离训练网络,耗时
3D卷积神经网络架构^{[8-9,36,38,40]}	1. 速度更快 2. 注重运动信息	1. 计算开销大,硬件要求高 2. 识别准确率比双流网络低
混合网络架构^{[10-11,45,48]}	1. 速度快,准确性高 2. 组合多样	1. 组合困难 2. 组合复杂度高

模型架构	优点	缺点
双流网络架构^[5,26,29-30]	1. 注重时空信息 2. 准确率高 3. 使用广泛	1. 依赖巨大的数据量输入 2. 硬件需求高 3. 分离训练网络,耗时
3D卷积神经网络架构^{[8-9,36,38,40]}	1. 速度更快 2. 注重运动信息	1. 计算开销大,硬件要求高 2. 识别准确率比双流网络低
混合网络架构^{[10-11,45,48]}	1. 速度快,准确性高 2. 组合多样	1. 组合困难 2. 组合复杂度高

方法	关键点	数据量	研究热度	配置要求	效果
传统方式	特征提取	少	中	低	良
深度学习	数据量支撑	多	高	高	优

Review of Human Behavior Recognition Research

人体行为识别研究综述

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 15

References 77

Related Articles 15

Recommended Articles 0

Metrics

数据集	年份	来源	样本数量	类数	平均时长/s	实例类/行为类别
KTH^[57]	2004	志愿者拍摄	600	6	4.00	走路、慢跑、跑步、拳击、鼓掌、挥手
UCF-Sports^[59]	2008	体育电视	150	10	6.39	体育运动：潜水、高尔夫运动、踢
Hollywood2^[58]	2009	好莱坞电影	3 669	12	19.00	日常生活：吃饭、打电话、握手
HMDB51^[62]	2011	互联网、电影	6 849	51	2.00~5.00	一般面部动作、交互面部动作、一般身体动作、物体交互动作、人体交互动作
UCF101^[60]	2012	视频网站	13 320	101	5.00	人与物体交互动作、身体动作、人体交互动作、乐器演奏、运动
Sports-1M^[43]	2015	视频网站	1 133 158	487	336.00	水上运动、团队运动、冬季运动、球类运动、对抗运动、动物交互运动
ActivityNet200	2016	视频网站	19 994	200	109.00	日常生活：跳远、遛狗、擦地板
Kinetics^[64]	2017	视频网站	306 245	400	10.00	弹奏乐器、日常生活：握手
Epic-Kitchens^[63]	2018	实验拍摄	432	149	10.00	厨房日常：做饭、打扫、准备食物、洗
AVA^[65]	2018	电影	57 600	80	3.00	日常生活：行走、踢、握手
COIN^[66]	2019	视频网站	11 827	180	14.19	日常生活：接发、刮胡、熨衣、抽血
HACS^[67]	2019	视频网站	504 000	200	2.00	运动：跳绳、撑杆跳高、铲雪
AVA-Kinetics^[68]	2020	视频网站	57 600	80	3.00	日常生活：拥抱、饮酒

文献	定性分析			识别率/%			相关分析
文献	输入	机制	网络框架/特征	Holly-wood2	Olympic Sports*	Sports-1M	优势	局限	适用场景
T：DT^[23]	RGB OF	Trajectories	HOG、HOF、MBH	58.3	—	—	表征强	相机运动	视频监控
T：IDT^[24]			HOG、HOF、MBH	64.3	91.1	—	稳定、可靠	耗时	运动员训练
T：IDT^[70]			MBH、SIHFMFCC	63.5	82.1	—	鲁棒性强	环境影响	运动员训练
T：MIFS^[71]			HOG、HOF、PCA	68.0	91.4	—	计算成本小	跟踪质量偏低	体育赛事
T:Video-Darwin^[72]	RGB	temporal pooling	HOF、MBH Comb	73.7	—	—	简单、速度快	机器普适性差	厨房日常
D：ELS Fusion^[43]	RGB	CNN	slow fusion	—	—	60.9	速度快、通用	相机运动影响	球类运动识别
D：HRP+RP	RGB	CNN	VGG-16	74.1	—	—	高容量编码	参数较多	日常监控
D：LSTM^[45]	RGB OF	CNN-LSTM	Two-stream 3D ConvNet	—	—	73.1	长视频、低计算	过程复杂	运动识别

[1]	AN Fengping, LI Xiaowei, CAO Xiang. Medical Image Classification Algorithm Based on Weight Initialization-Sliding Window CNN [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1885-1897.
[2]	HUANG Hao, GE Hongwei. Deep Residual Expression Recognition Network to Enhance Inter-class Discrimination [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1842-1849.
[3]	YU Huilin, CHEN Wei, WANG Qi, GAO Jianwei, WAN Huaiyu. Knowledge Graph Link Prediction Based on Subgraph Reasoning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1800-1808.
[4]	ZENG Fanzhi, XU Luqian, ZHOU Yan, ZHOU Yuexia, LIAO Junwei. Review of Knowledge Tracing Model for Intelligent Education [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1742-1763.
[5]	LIU Yi, LI Mengmeng, ZHENG Qibin, QIN Wei, REN Xiaoguang. Survey on Video Object Tracking Algorithms [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1504-1515.
[6]	ZHAO Xiaoming, YANG Yijiao, ZHANG Shiqing. Survey of Deep Learning Based Multimodal Emotion Recognition [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1479-1503.
[7]	XIA Hongbin, XIAO Yifei, LIU Yuan. Long Text Generation Adversarial Network Model with Self-Attention Mechanism [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1603-1610.
[8]	LI Yuxuan, HONG Xuehai, WANG Yang, TANG Zhengzheng, BAN Yan. Groupwise Learning to Rank Algorithm with Introduction of Activated Weighting [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1594-1602.
[9]	SUN Fangwei, LI Chengyang, XIE Yongqiang, LI Zhongbo, YANG Caidong, QI Jin. Review of Deep Learning Applied to Occluded Object Detection [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1243-1259.
[10]	LIU Yafen, ZHENG Yifeng, JIANG Lingyi, LI Guohe, ZHANG Wenjie. Survey on Pseudo-Labeling Methods in Deep Semi-supervised Learning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1279-1290.
[11]	ZHANG Yancao, ZHAO Yuhai, SHI Lan. Multi-feature Based Link Prediction Algorithm Fusing Graph Attention [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 1096-1106.
[12]	OU Yangliu, HE Xi, QU Shaojun. Fully Convolutional Neural Network with Attention Module for Semantic Segmentation [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 1136-1145.
[13]	CHENG Weiyue, ZHANG Xueqin, LIN Kezheng, LI Ao. Deep Convolutional Neural Network Algorithm Fusing Global and Local Features [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 1146-1154.
[14]	TONG Gan, HUANG Libo. Review of Winograd Fast Convolution Technique Research [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 959-971.
[15]	ZHONG Mengyuan, JIANG Lin. Review of Super-Resolution Image Reconstruction Algorithms [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 972-990.

文献	定性分析			识别率/%		相关分析
文献	输入	机制	网络框架/特征	HMDB51	UCF101	优势	局限	适用场景
T：Part Model^[73]	RGB、OF	Multiscale local model	GBH、PCA	61.0	86.6	实时计算较好	复杂	视频监控
T+D：TDD^[74]		Trajectories、 Two-stream	TDD+IDT	65.9	91.5	避免手工	识别精度有限	人体交互
T+D：TDD^[74]		Trajectories、 Two-stream	TDD	63.2	90.3	避免手工	识别精度有限	人体交互
D：Segmentation^[75]	RGB、OF	Two-stream	BN-Inception Inception-v3	71.7 72.3	95.2 95.5	加强网络泛化	识别速度	日常生活
D：Attention- ConvLSTM^[47]	RGB、OF		VGG-16	69.8	94.6	强化帧间依赖	参数较多	运动场景
D：Fusion^[29]	RGB、OF		VGG-16、DT	69.2	93.5	时空融合	计算复杂性	日常生活
D：TSN^[5]	RGB、OF、WF		BN-Inception	69.4	94.2	高效学习整个视频	初始化复杂	长镜头场景
D：Sparse+TSN^[76]	RGB、OF Sparse		Inception	76.4	96.9	特征利用率提高	特征交互弱	体育运动
D：I3D^[38]	RGB、OF	3D-CNN	BN-Inception	80.7	98.0	感受野更大	相机移动影响	日常监控
D：LTC^[40]	RGB、OF	3D-CNN	depth-d	67.2	92.7	长序列探索	时间复杂度	体育运动
D：P3D+IDT^[8]	RGB、OF	2D+1D CNN	ResNet	—	93.7	参数量降低	识别效果	运动场景
D：R(2+1)D^[77]	RGB	Mixed convolution	ResNet	78.7	97.3	参数易优化	空间分辨率	日常生活
D：CNN+LSTM^[45]	RGB、OF	CNN+LSTM	AlexNet、GoogleLeNet	—	88.6	降低噪声影响	参数复杂	体育运动

文献	定性分析			识别率/%		相关分析
文献	输入	机制	网络框架/特征	HMDB51	UCF101	优势	局限	适用场景
T：Part Model^[73]	RGB、OF	Multiscale local model	GBH、PCA	61.0	86.6	实时计算较好	复杂	视频监控
T+D：TDD^[74]		Trajectories、 Two-stream	TDD+IDT	65.9	91.5	避免手工	识别精度有限	人体交互
T+D：TDD^[74]		Trajectories、 Two-stream	TDD	63.2	90.3	避免手工	识别精度有限	人体交互
D：Segmentation^[75]	RGB、OF	Two-stream	BN-Inception Inception-v3	71.7 72.3	95.2 95.5	加强网络泛化	识别速度	日常生活
D：Attention- ConvLSTM^[47]	RGB、OF		VGG-16	69.8	94.6	强化帧间依赖	参数较多	运动场景
D：Fusion^[29]	RGB、OF		VGG-16、DT	69.2	93.5	时空融合	计算复杂性	日常生活
D：TSN^[5]	RGB、OF、WF		BN-Inception	69.4	94.2	高效学习整个视频	初始化复杂	长镜头场景
D：Sparse+TSN^[76]	RGB、OF Sparse		Inception	76.4	96.9	特征利用率提高	特征交互弱	体育运动
D：I3D^[38]	RGB、OF	3D-CNN	BN-Inception	80.7	98.0	感受野更大	相机移动影响	日常监控
D：LTC^[40]	RGB、OF	3D-CNN	depth-d	67.2	92.7	长序列探索	时间复杂度	体育运动
D：P3D+IDT^[8]	RGB、OF	2D+1D CNN	ResNet	—	93.7	参数量降低	识别效果	运动场景
D：R(2+1)D^[77]	RGB	Mixed convolution	ResNet	78.7	97.3	参数易优化	空间分辨率	日常生活
D：CNN+LSTM^[45]	RGB、OF	CNN+LSTM	AlexNet、GoogleLeNet	—	88.6	降低噪声影响	参数复杂	体育运动