Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (12): 3219-3234.DOI: 10.3778/j.issn.1673-9418.2401064

• Graphics·Image • Previous Articles     Next Articles

Human Uncivilized Behavior Detection Method Integrating Non-uniform Sampling and Feature Enhancement

YE Hao, WANG Longye, ZENG Xiaoli, XIAO Yue   

  1. 1. School of Electronics and Information Engineering, Southwest Petroleum University, Chengdu 610500, China
    2. School of Information Science and Technology, Tibet University, Lhasa 850000, China
  • Online:2024-12-01 Published:2024-11-29

融合非均匀采样与特征强化的人体不文明行为检测方法

叶浩,王龙业,曾晓莉,肖越   

  1. 1. 西南石油大学 电气信息学院, 成都 610500
    2. 西藏大学 信息科学技术学院, 拉萨 850000

Abstract: In order to solve the problems of misdetection of similar behaviors and low accuracy for detecting local body behaviors in the spatio-temporal action detection of abnormal human behavior, based on the self-made uncivilized behavior spatio-temporal action detection dataset (UBSAD), a method that integrates non-uniform sampling and feature enhancement is proposed. Firstly, this method incorporates the video swin transformer (VST) as the backbone network in the spatio-temporal feature extraction stage to capture long-term temporal dependencies in videos, and enhance the network’s global information learning capability. Additionally, a ringed residual VST block replaces the standard VST block in the final stage of the backbone network, enlarging the difference between target area and background area. Combined with the multi-head self-attention mechanism, the feature extraction of the target area is strengthened. Furthermore, during the video frame collection stage, a unique non-uniform sampling method is proposed to adjust the input data distribution according to task requirements, allowing the model to obtain action change information in a hierarchical manner, effectively improving the network’s attention to detailed features of similar behaviors. Finally, after the feature extraction network, a new cascaded pooling three-dimensional spatial pyramid feature enhancement module incorporating shallow features is embedded to further enhance feature applicability at various scales, reduce the loss of detailed motion information during the feature extraction process, reduce the interference of background information, and achieve the effect of feature enhancement. Experimental results show that the method achieves mAP of 71.93% and 83.09% respectively on the UBSAD dataset and the public dataset UCF101-24. They are 7.39 percentage points and 1.22 percentage points higher than those of using the baseline network VST as the spatio-temporal feature extraction model, demonstrating the method’s effectiveness in accurately detecting behavior.

Key words: spatio-temporal motion detection, ringed residual Video Swin Transformer, non-uniform sampling, cascaded pooling three-dimensional spatial pyramid

摘要: 针对人体异常行为时空动作检测对相似行为存在误检及局部肢体行为检测精度较低的问题,基于自制的不文明行为时空动作检测数据集(UBSAD),提出了一种融合非均匀采样与特征强化的人体不文明行为检测方法。该方法在时空特征提取阶段引入Video Swin Transformer(VST)作为主干网络,用于捕获视频中的长期时序依赖关系,提升网络全局信息学习能力;将提出的环形残差VST模块替换主干网络最后阶段的VST模块,放大目标区域和背景区域的差异,并结合多头自注意力机制,强化对目标区域的特征提取;在视频帧采集阶段提出了独特的非均匀采样方法,根据任务需求调整输入数据分布,使模型有层次地获取动作变化信息,有效提升网络对相似行为细节特征的关注;在特征提取网络之后嵌入新的融合浅层特征的级联池化三维空间金字塔特征强化模块,进一步增强多种尺度下的特征适用性,有效减少动作细节信息在特征提取过程中的丢失和降低背景信息的干扰,实现特征强化的效果。实验结果表明,该方法在UBSAD数据集和公开数据集UCF101-24上mAP指标分别达到了71.93%和83.09%,比使用基线网络VST作为时空特征提取模型分别提高了7.39个百分点和1.22个百分点,能够有效检测目标的行为。

关键词: 时空动作检测, 环形残差Video Swin Transformer, 非均匀采样, 级联池化三维空间金字塔