时空特征金字塔模块下的视频行为识别

doi:10.3778/j.issn.1673-9418.2012119

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (9): 2061-2067.DOI: 10.3778/j.issn.1673-9418.2012119

时空特征金字塔模块下的视频行为识别

龚苏明, 陈莹()

江南大学轻工过程先进控制教育部重点实验室,江苏无锡 214122

收稿日期:2020-12-31 修回日期:2021-02-25 出版日期:2022-09-01 发布日期:2021-03-04
通讯作者: + E-mail: chenying@jiangnan.edu.cn
作者简介:龚苏明（1995—）,男,江苏镇江人,硕士研究生,主要研究方向为模式识别、行为识别。
陈莹（1976—）,女,浙江丽水人,博士,教授,CCF高级会员,主要研究方向为模式识别、信息融合。
基金资助:
国家自然科学基金(61573168)

Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module

GONG Suming, CHEN Ying()

Key Laboratory of Advanced Process Control for Light Industry, Ministry of Education, Jiangnan University, Wuxi, Jiangsu 214122, China

Received:2020-12-31 Revised:2021-02-25 Online:2022-09-01 Published:2021-03-04
About author:GONG Suming, born in 1995, M.S. candidate. His research interests include pattern recognition and action recognition.
CHEN Ying, born in 1976, Ph.D., professor, senior member of CCF. Her research interests include pattern recognition and information fusion.
Supported by:
National Natural Science Foundation of China(61573168)

摘要/Abstract

摘要：

目前用于视频行为识别的主流2D卷积神经网络方法无法提取输入帧之间的相关信息,导致网络无法获得输入帧间的时空特征信息进而难以提升识别精度。针对目前主流方法存在的问题,提出了通用的时空特征金字塔模块（STFPM）。STFPM由特征金字塔和空洞卷积金字塔两部分组成,并能直接嵌入到现有的2D卷积神经网络中构成新的行为识别网络——时空特征金字塔网络（STFP-Net）。针对多帧图像输入,STFP-Net首先提取每帧输入的单独空域特征信息,并将这些特征信息记为原始特征;然后,所设计的STFPM利用矩阵转换操作对原始特征构建特征金字塔;其次,利用空洞卷积金字塔对构建的原始特征金字塔提取具有时空关联性的时序特征;接着,将原始特征与时序特征进行加权融合并传递给后续深层网络;最后,利用全连接对网络输出特征进行分类识别。与Baseline相比,STFP-Net引入了可忽略不计的额外参数和计算量。实验结果表明,与近些年主流方法相比,STFP-Net在主流数据库UCF101和HMDB51上的分类准确度具有明显提升。

关键词: 行为识别, 2D卷积网络, 时空特征, 特征金字塔, 空洞卷积金字塔

Abstract:

At present, the mainstream 2D convolution neural network method for video action recognition can't extract the relevant information between input frames, which makes it difficult for the network to obtain the spatio-temporal feature information between input frames and improve the recognition accuracy. To solve the existing problems, a universal spatio-temporal feature pyramid module (STFPM) is proposed. STFPM consists of feature pyramid and dilated convolution pyramid, which can be directly embedded into the existing 2D convolution network to form a new action recognition network named spatio-temporal feature pyramid net (STFP-Net). For multi-frame image input, STFP-Net first extracts the individual spatial feature information of each frame input and records it as the original feature. Then, the designed STFPM uses matrix operation to construct the feature pyramid of the original feature. Furthermore, the spatio-temporal features with temporal and spatial correlation are extracted by applying the dilated convolution pyramid to feature pyramid. Next, the original features and spatio-temporal features are fused by a weighted summation and transmitted to the deep network. Finally, the action in the video is classified by full connected layer. Compared with Baseline, STFP-Net introduces negligible additional parameters and computational complexity. Experimental results show that compared with mainstream methods in recent years, STFP-Net has significant improvement in classification accuracy on the general datasets UCF101 and HMDB51.

Key words: action recognition, 2D convolution network, spatio-temporal features, feature pyramid, dilated convolu-tion pyramid

中图分类号:

TP391.4

龚苏明, 陈莹. 时空特征金字塔模块下的视频行为识别[J]. 计算机科学与探索, 2022, 16(9): 2061-2067.

GONG Suming, CHEN Ying. Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2061-2067.

图/表 10

图1 ResNet50构成模块及改进模块

Fig.1 Composition module of ResNet50 and corresponding improvement module

图2 网络整体架构

Fig.2 Overall structure of network

图3 特征金字塔示意图

Fig.3 Schematic diagram of feature pyramid

图4 空洞卷积和卷积金字塔

Fig.4 Dilated convolution and convolution pyramid

图5 时空特征金字塔详细过程

Fig.5 Detailed process of spatio-temporal feature pyramid

表1 特征融合策略的影响

Table 1 Influence of feature fusion strategy

融合策略	UCF101正确率/%
特征级联	86.994
加权融合	90.325

表2 金字塔层数的影响

Table 2 Influence of pyramid layers

层数	UCF101正确率/%
2	88.686
3	90.325
4	89.638

表3 嵌入位置的影响

Table 3 Influence of embedding position

嵌入位置	UCF101正确率/%
Stage{3, 4}	90.219
Stage4	90.325

表4 模型参数和计算量

Table 4 Model parameters and calculation amount

网络	模型大小/MB	FLOPs/10⁹
Baseline	25.557 032	32.892 117
STFP-Net（2层）	25.557 059	32.902 955
STFP-Net（3层）	25.557 086	32.913 793
STFP-Net（4层）	25.557 113	32.924 631

表5 与主流方法正确率的比较

Table 5 Comparison of accuracy with mainstream methods 单位：%

方法	UCF101	HMDB51	方法	UCF101	HMDB51
C3D+IDT^[9]	90.4	—	I3D^[21]	95.7	74.3
R(2+1)D^[16]	95.0	72.7	StNet^[22]	93.5	—
LTC^[17]	91.7	64.8	Hidden-ts^[23]	93.2	66.8
Ts+LSTM^[18]	88.6	—	STH^[24]	96.0	74.8
TLE^[19]	95.4	71.1	ISTPA-Net^[25]	95.5	70.7
MiCT^[11]	94.7	70.5	Baseline	94.2	69.4
TSM^[20]	94.5	70.7	Ours(STFP-Net)	96.4	75.5
P3D^[10]	93.7	—

参考文献 25

[1]	朱红蕾, 朱昶胜, 徐志刚. 人体行为识别数据集研究进展[J]. 自动化学报, 2018, 44(6): 978-1004.
	ZHU H L, ZHU C S, XU Z G. Research progress of human action recognition datasets[J]. Acta Automatica Sinica, 2018, 44(6): 978-1004.
[2]	IKIZLER-CINBIS N, SCLAROFF S. Object, scene and actions: combining multiple features for human action recognition[C]// LNCS 6311: Proceedings of the 11th Europ-ean Conference on Computer Vision, Heraklion, Sep 5-11, 2010. Berlin, Heidelberg: Springer, 2010: 494-507.
[3]	张良, 鲁梦梦, 姜华. 局部分布信息增强的视觉单词描述与动作识别[J]. 电子与信息学报, 2016, 38(3): 549-556.
	ZHANG L, LU M M, JIANG H. An improved scheme of visual words description and action recognition using local enhanced distribution information[J]. Journal of Electronics & Information Technology, 2016, 38(3): 549-556.
[4]	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 1725-1732.
[5]	SIMONYAN K, ZISSERMAN A. Two-stream convoluti-onal networks for action recognition in videos[C]// Procee-dings of the Annual Conference on Neural Information Processing Systems 2014, Montreal, Dec 8-11, 2014. Red Hook: Curran Associates, 2014: 568-576.
[6]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segm-ent networks: towards good practices for deep action recognition[C]// LNCS 9912: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 8-16, 2016. Cham: Springer, 2016: 20-36.
[7]	周波, 李俊峰. 结合目标检测的人体行为识别[J]. 自动化学报, 2019, 42(5): 56-67.
	ZHOU B, LI J F. Human action recognition combined with object detection[J]. Acta Automatica Sinica, 2019, 42(5): 56-67.
[8]	刘天亮, 谯庆伟, 万俊伟, 等. 融合空间-时间双网络流和视觉注意的人体行为识别[J]. 电子与信息学报, 2018, 40(10): 2395-2401.
	LIU T L, QIAO Q W, WAN J W, et al. Human action recognition based on spatial-temporal double network flow and visual attention[J]. Journal of Electronics and Inform-ation Technology, 2018, 40(10): 2395-2401.
[9]	JI S W, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transa-ctions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221-231.
[10]	QIU Z F, YAO T, MEI T. Learning spatio-temporal repr-esentation with pseudo-3D residual networks[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Comp-uter Society, 2017: 5533-5541.
[11]	ZHOU Y, SUN X, ZHA Z J, et al. MICT: mixed 3D/2D convolutional tube for human action recognition[C]// Proce-edings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 449-458.
[12]	YU F, KOLTUN V. Multi-scale context aggregation by dil-ated convolutions[C]// Proceedings of the 4th International Conference on Learning Representations, San Juan, May 2-4, 2016: 1-13.
[13]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recogn-ition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Com-puter Society, 2016: 770-778.
[14]	SOOMRO K, ROSHAN ZAMIR A, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402, 2012.
[15]	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]// Pro-ceedings of the 2011 International Conference on Computer Vision, Barcelona, Nov 6-13, 2011. Washington: IEEE Com-puter Society, 2011: 2556-2563.
[16]	TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]// Pro-ceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 6450-6459.
[17]	VAROL G, LAPTEV I, SCHMID C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1510-1517. DOI URL
[18]	NG Y H, HAUSKNECHT M J, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video class-ification[C]// Proceedings of the 2015 IEEE Confer-ence on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 4694-4702.
[19]	DIBA A, SHARMA V, VAN GOOL L. Deep temporal lin-ear encoding networks[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 1541-1550.
[20]	LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 7082-7092.
[21]	CARREIRA J, ZISSERMAN A. Quo vadis, action recogn-ition? A new model and the kinetics dataset[C]// Procee-dings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Was-hington: IEEE Computer Society, 2017: 4724-4733.
[22]	HE D L, ZHOU Z C, GAN C, et al. StNet: local and global spatial-temporal modeling for action recognition[C]// Proc-eedings of the 33rd AAAI Conference on Artificial Intelli-gence, the 31st Innovative Applications of Artificial Intellig-ence Conference, the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, Jan 27-Feb 1, 2019. Menlo Park: AAAI, 2019: 8401-8408.
[23]	ZHU Y, LAN Z Z, NEWSAM S D, et al. Hidden two-stream convolutional networks for action recognition[C]// LNCS 11363: Proceedings of the 14th Asian Conference on Computer Vision, Perth, Dec 2-6, 2018. Cham: Springer, 2018: 363-378.
[24]	LI X, WANG J, MA L, et al. STH: spatio-temporal hybrid convolution for efficient action recognition[J]. arXiv:2003.08042, 2020.
[25]	DU Y, YUAN C F, LI B, et al. Interaction-aware spatio-temporal pyramid attention networks for action classifica-tion[C]// LNCS 11220: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 388-404.

时空特征金字塔模块下的视频行为识别

Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 25

相关文章 7

编辑推荐

Metrics

[1]	裴利沈, 赵雪专. 群体行为识别深度学习方法研究综述[J]. 计算机科学与探索, 2022, 16(4): 775-790.
[2]	裴利沈, 刘少博, 赵雪专. 人体行为识别研究综述[J]. 计算机科学与探索, 2022, 16(2): 305-322.
[3]	陶志勇, 郭京, 刘影. 基于多天线判决的CSI高效人体行为识别方法[J]. 计算机科学与探索, 2021, 15(6): 1122-1132.
[4]	苏江毅, 宋晓宁, 吴小俊, 於东军. 多模态轻量级图卷积人体骨架行为识别方法[J]. 计算机科学与探索, 2021, 15(4): 733-742.
[5]	王新文，谢林柏，彭力. 跌倒异常行为的双重残差网络识别方法[J]. 计算机科学与探索, 2020, 14(9): 1580-1589.
[6]	殷聪聪，张秋菊. 机器人演示学习编程技术研究综述[J]. 计算机科学与探索, 2020, 14(8): 1275-1287.
[7]	董旭，谭励，周丽娜，宋艳艳. 联合场景和行为特征的短视频行为识别[J]. 计算机科学与探索, 2020, 14(10): 1754-1761.