Human Uncivilized Behavior Detection Method Integrating Non-uniform Sampling and Feature Enhancement

doi:10.3778/j.issn.1673-9418.2401064

Abstract

Abstract: In order to solve the problems of misdetection of similar behaviors and low accuracy for detecting local body behaviors in the spatio-temporal action detection of abnormal human behavior, based on the self-made uncivilized behavior spatio-temporal action detection dataset (UBSAD), a method that integrates non-uniform sampling and feature enhancement is proposed. Firstly, this method incorporates the video swin transformer (VST) as the backbone network in the spatio-temporal feature extraction stage to capture long-term temporal dependencies in videos, and enhance the network’s global information learning capability. Additionally, a ringed residual VST block replaces the standard VST block in the final stage of the backbone network, enlarging the difference between target area and background area. Combined with the multi-head self-attention mechanism, the feature extraction of the target area is strengthened. Furthermore, during the video frame collection stage, a unique non-uniform sampling method is proposed to adjust the input data distribution according to task requirements, allowing the model to obtain action change information in a hierarchical manner, effectively improving the network’s attention to detailed features of similar behaviors. Finally, after the feature extraction network, a new cascaded pooling three-dimensional spatial pyramid feature enhancement module incorporating shallow features is embedded to further enhance feature applicability at various scales, reduce the loss of detailed motion information during the feature extraction process, reduce the interference of background information, and achieve the effect of feature enhancement. Experimental results show that the method achieves mAP of 71.93% and 83.09% respectively on the UBSAD dataset and the public dataset UCF101-24. They are 7.39 percentage points and 1.22 percentage points higher than those of using the baseline network VST as the spatio-temporal feature extraction model, demonstrating the method’s effectiveness in accurately detecting behavior.

Key words: spatio-temporal motion detection, ringed residual Video Swin Transformer, non-uniform sampling, cascaded pooling three-dimensional spatial pyramid

摘要： 针对人体异常行为时空动作检测对相似行为存在误检及局部肢体行为检测精度较低的问题，基于自制的不文明行为时空动作检测数据集（UBSAD），提出了一种融合非均匀采样与特征强化的人体不文明行为检测方法。该方法在时空特征提取阶段引入Video Swin Transformer（VST）作为主干网络，用于捕获视频中的长期时序依赖关系，提升网络全局信息学习能力；将提出的环形残差VST模块替换主干网络最后阶段的VST模块，放大目标区域和背景区域的差异，并结合多头自注意力机制，强化对目标区域的特征提取；在视频帧采集阶段提出了独特的非均匀采样方法，根据任务需求调整输入数据分布，使模型有层次地获取动作变化信息，有效提升网络对相似行为细节特征的关注；在特征提取网络之后嵌入新的融合浅层特征的级联池化三维空间金字塔特征强化模块，进一步增强多种尺度下的特征适用性，有效减少动作细节信息在特征提取过程中的丢失和降低背景信息的干扰，实现特征强化的效果。实验结果表明，该方法在UBSAD数据集和公开数据集UCF101-24上mAP指标分别达到了71.93%和83.09%，比使用基线网络VST作为时空特征提取模型分别提高了7.39个百分点和1.22个百分点，能够有效检测目标的行为。

关键词: 时空动作检测, 环形残差Video Swin Transformer, 非均匀采样, 级联池化三维空间金字塔

YE Hao, WANG Longye, ZENG Xiaoli, XIAO Yue. Human Uncivilized Behavior Detection Method Integrating Non-uniform Sampling and Feature Enhancement[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(12): 3219-3234.

叶浩, 王龙业, 曾晓莉, 肖越. 融合非均匀采样与特征强化的人体不文明行为检测方法[J]. 计算机科学与探索, 2024, 18(12): 3219-3234.

References

[1] 张婷, 张兴忠, 王慧民, 等. 基于图神经网络的变电站场景三维目标检测[J]. 计算机工程与应用, 2023, 59(9): 329-336.
ZHANG T, ZHANG X Z, WANG H M, et al. 3D object detection in substation scene based on graph neural network[J]. Computer Engineering and Applications, 2023, 59(9): 329-336.
[2] 陆慧敏, 杨朔. 基于深度神经网络的自动驾驶场景三维目标检测算法[J]. 北京工业大学学报, 2022, 48(6): 589-597.
LU H M, YANG S. Three-dimensional object detection algorithm based on deep neural networks for automatic driving[J]. Journal of Beijing University of Technology, 2022, 48(6): 589-597.
[3] 黄磊, 杨媛, 杨成煜, 等. FS-YOLOv5：轻量化红外目标检测方法[J]. 计算机工程与应用, 2023, 59(9): 215-224.
HUANG L, YANG Y, YANG C Y, et al. FS-YOLOv5: lightweight infrared rode target detection method[J]. Computer Engineering and Applications, 2023, 59(9): 215-224.
[4] 谭暑秋, 汤国放, 涂媛雅, 等. 教室监控下学生异常行为检测系统[J]. 计算机工程与应用, 2022, 58(7): 176-184.
TAN S Q, TANG G F, TU Y Y, et al. Classroom monitoring students abnormal behavior detection system[J]. Computer Engineering and Applications, 2022, 58(7): 176-184.
[5] REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. [2023-11-22]. https://arxiv.org/abs/1804.02767.
[6] 甘海明, 薛月菊, 李诗梅, 等. 基于时空信息融合的母猪哺乳行为识别[J]. 农业机械学报, 2020, 51(S1): 357-363.
GAN H M, XUE Y J, LI S M, et al. Automatic sow nursing behaviour recognition based on spatio-temporal information fusion[J]. Journal of Agricultural Machinery, 2020, 51(S1): 357-363.
[7] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2016: 1933-1941.
[8] 杨永闯, 王昊, 王新良. 基于改进SSD的食物浪费行为识别方法[J]. 计算机工程与设计, 2023, 44(8): 2523-2530.
YANG Y C, WANG H, WANG X L. Food waste behavior recognition method based on improved SSD[J]. Computer Engineering and Design, 2023, 44(8): 2523-2530.
[9] 胡学敏, 陈钦, 杨丽, 等. 基于深度时空卷积神经网络的人群异常行为检测和定位[J]. 计算机应用研究, 2020, 37(3): 891-895.
HU X M, CHEN Q, YANG L, et al. Abnormal crowd behavior detection and localization based on deep spatial-temporal convolutional neural network[J]. Computer Application Research, 2020, 37(3): 891-895.
[10] JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231.
[11] 吴丽君, 李斌斌, 陈志聪, 等. 3D多重注意力机制下的行为识别[J]. 福州大学学报(自然科学版), 2022, 50(1): 47-53.
WU L J, LI B B, CHEN Z C, et al. Action recognition under 3D multiple attention mechanism[J]. Journal of Fuzhou University (Natural Science Edition), 2022, 50(1): 47-53.
[12] HARA K, KATAOKA H, SATOH Y. Learning spatio-temporal features with 3D residual networks for action recognition[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2017: 3154-3160.
[13] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[14] 杨乐, 黎亦凡, 陈曦, 等. 基于ST-SlowFast的电力生产环境违规行为检测[J]. 智慧电力, 2023, 51(6): 71-77.
YANG L, LI Y F, CHEN X, et al. Violation detectionin power production scenarios based on ST-SlowFastin[J]. Smart Electricity, 2023, 51(6): 71-77.
[15] FEICHTENHOFER C, FAN H, MALIK J, et al. Slowfast networks for video recognition[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 6202-6211.
[16] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2023-11-22]. https://arxiv.org/abs/2010.11929.
[17] FAN H, XIONG B, MANGALAM K, et al. Multiscale vision transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 6824-6835.
[18] ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 6836-6846.
[19] LIU Z, NING J, CAO Y, et al. Video swin transformer[C]//Proceedings of the 2022 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition. Piscataway: IEEE, 2022: 3202-3211.
[20] LIN J, GAN C, HAN S. TSM: temporal shift modulefor efficient video understanding[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 7083-7093.
[21] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916.
[22] CHEN L C, PAPAN-DREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. [2023-11-22]. https://arxiv.org/abs/1706.05587.
[23] GE Z, LIU S, WANG F, et al. YOLOX: exceeding YOLOseries in 2021[EB/OL]. [2023-11-22]. https://arxiv.org/abs/2107.08430.
[24] HE K, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2017: 2961-2969.
[25] WU C Y, FEICHT-ENHOFER C, FAN H, et al. Long-term feature banks for detailed video understanding[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 284-293.
[26] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2018: 7132-7141.
[27] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2017: 2117-2125.
[28] GU C, SUN C, ROSS D A, et al. AVA: a video dataset of spatio-temporally localized atomic visual actions[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2018: 6047-6056.
[29] SULTANI W, CHEN C, SHAH M. Real-world anomaly detection in surveillance videos[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2018: 6479-6488.
[30] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]//Proceedings of the 2017 IEEE Conference on Computer Visionand Pattern Recognition. Washington: IEEE Computer Society, 2017: 6299-6308.
[31] SUN C, SHRIVASTAVA A, VONDRICK C, et al. Actor-centric relation network[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 318-334.
[32] FEICHTENHOFER C. X3D: expanding architectures for efficient video recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition. Piscataway: IEEE, 2020: 203-213.
[33] TRAN D, WANG H, TORRESANI L, et al. Video classification with channel-separated convolutional networks[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 5552-5561.
[34] LI Y, WU C Y, FAN H, et al. MViTv2: improved multiscale vision transformers for classification and detection[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 4804-4814.