Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

Human uncivilized behavior detection method integrating non-uniform sampling and feature enhancement

YE Hao, WANG Longye, ZENG Xiaoli, XIAO Yue   

  1. 1. School of Electronics and Information Engineering, Southwest Petroleum University, Chengdu 610500, China
    2. School of Information Science and Technology, Tibet University, Lhasa 850000, China

融合非均匀采样与特征强化的人体不文明行为检测方法

叶浩,王龙业,曾晓莉,肖越   

  1. 1. 西南石油大学 电气信息学院, 成都 610500
    2. 西藏大学 信息科学技术学院, 拉萨 850000

Abstract: In order to solve the problems of misdetection of similar behaviors and low accuracy for detecting local body behaviors in the spatio-temporal action detection of abnormal human behavior, based on the self-made uncivilized behavior spatio-temporal action detection dataset(UBSAD), a method that integrated non-uniform sampling and feature enhancement was proposed. The method initially incorporated the Video Swin Transformer(VST) as the backbone network in the spatio-temporal feature extraction stage to capture long-term temporal dependencies in videos, enhance the network's global information learning capability. Additionally, a ringed residual VST block replaced the standard VST block in the final stage of the backbone network, enlarged the difference between the target area and background area, and combined with the multi-head self-attention mechanism, strengthening the feature extraction of the target area. Furthermore, during the video frame collection stage, a unique non-uniform sampling method was proposed to adjust the input data distribution according to task requirements, allowing the model to obtain action change information in a hierarchical manner, effectively improving the network's attention to detailed features of similar behaviors. Finally, after the feature extraction network, a new cascaded pooling three-dimensional spatial pyramid feature enhancement module incorporating shallow features was embedded to further enhance feature applicability at various scales, effectively reduced the loss of detailed motion information during the feature extraction process, reduced the interference of background information, and achieved the effect of feature enhancement. Experimental results show that the method achieves mAP metrics of 71.93% and 83.09% respectively on the UBSAD dataset and the public dataset UCF101-24. These are 7.39% and 1.22% higher than using the baseline network VST as the spatio-temporal feature extraction model, demonstrating the method's effectiveness in accurately detecting behavior.

Key words: spatio-temporal motion detection, ringed residual Video Swin Transformer, non-uniform sampling, cascaded pooling three-dimensional spatial pyramid

摘要: 针对人体异常行为时空动作检测对相似行为存在误检及局部肢体行为检测精度较低的问题,基于自制的不文明行为时空动作检测数据集(UBSAD),提出了一种融合非均匀采样与特征强化的人体不文明行为检测方法。该方法首先在时空特征提取阶段引入Video Swin Transformer(VST)作为主干网络,用于捕获视频中的长期时序依赖关系,提升网络全局信息学习能力;接下来,将提出的环形残差VST模块替换主干网络最后阶段的VST模块,放大目标区域和背景区域的差异,并结合多头自注意力机制,强化对目标区域的特征提取;此外,在视频帧采集阶段提出了独特的非均匀采样方法,根据任务需求调整输入数据分布,使模型有层次地获取动作变化信息,有效提升网络对相似行为细节特征的关注;最后,在特征提取网络之后嵌入新的融合浅层特征的级联池化三维空间金字塔特征强化模块,进一步增强多种尺度下的特征适用性,有效减少动作细节信息在特征提取过程中的丢失和降低背景信息的干扰,实现特征强化的效果。实验结果表明,该方法在UBSAD数据集和公开数据集UCF101-24上mAP指标分别达到了71.93%和83.09%,比使用基线网络VST作为时空特征提取模型分别提高了7.39%和1.22%,能够有效检测目标的行为。

关键词: 时空动作检测, 环形残差Video Swin Transformer, 非均匀采样, 级联池化三维空间金字塔