Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (5): 1286-1300.DOI: 10.3778/j.issn.1673-9418.2303110

• Graphics·Image • Previous Articles     Next Articles

Dense Pedestrian Detection Based on Shifted Window Attention Multi-scale Equalization

YU Fan, ZHANG Jing   

  1. School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 200335, China
  • Online:2024-05-01 Published:2024-04-29

滑窗注意力多尺度均衡的密集行人检测算法

于范,张菁   

  1. 上海工程技术大学 电子电气工程学院,上海 200335

Abstract: Due to the large differences in the shape and scale of pedestrian targets in real-world scenarios, compared with traditional methods, which often have lower average accuracy in pedestrian detection, transformer-based networks with attention mechanisms have shown strong performance in the field of pedestrian detection. However, there are still some difficulties in multi-scale detection in dense scenes. In dense scenes, there are usually a large number of occluded or small-scale pedestrian targets, leading to a large number of false and missed detections, as well as a significant amount of computing resources. Additionally, accurate detection of all targets becomes extremely difficult when pedestrian targets overlap significantly. To address these issues, a dense scene multi-scale pedestrian detection algorithm based on shifted window attention is proposed. Using modified Swin blocks in backbone enables the network to extract more detailed features while reducing the heavy computational burden brought by attention mechanisms. To effectively solve the feature fusion problem, DyHead blocks are used in the neck to unify multiple attention operations, thereby improving feature fusion efficiency. To address the feature balance issue, a feature scale-equalizing module based on full connection is designed, which constructs different residual structures between various levels of the feature pyramid to balance features and assist the model in generating higher-quality feature maps. Experimental results on the WiderPerson dataset show that this algorithm improves AP value by 1.1 percentage points, with 1.0 and 0.7 percentage points improvement in the most important small and medium targets, respectively.

Key words: multi-scale pedestrian detection, deep learning, dense scenes, shifted window attention, feature fusion and balance

摘要: 由于现实场景下的行人目标在形态、尺度等方面存在巨大差异,相比于传统方法对多尺度行人检测平均精准率较低的情况,基于Transformer注意力机制的网络在行人检测领域已经展现出强大的性能。然而,密集场景下的多尺度检测仍存在一些难点。在密集场景中,通常会包含大量的被遮挡或小规模的行人目标,导致模型产生大量的误检和漏检,同时耗费大量的计算资源。此外,当行人目标重叠较为严重时,准确地检出所有目标也会变得极为困难。为了解决上述问题,提出了一种基于滑窗注意力的密集场景多尺度行人检测算法。在Backbone中使用改进Swin block使得网络能够提取到更多的细节特征,同时减少注意力机制带来的繁重计算量。为有效解决特征融合问题,在Neck部分使用DyHead block来统一多个注意力运算,以此提高特征融合效率。针对特征均衡问题,设计了一种基于全连接的特征尺度均衡模块,通过在特征金字塔的各层级之间构造不同的残差结构来进行特征平衡,辅助模型生成更高质量的特征图。在WiderPerson数据集上的实验结果表明,该算法在AP值上提升了1.1个百分点,在最值得关注的小目标和中目标上也分别有1.0和0.7个百分点的提升。

关键词: 多尺度行人检测, 深度学习, 密集场景, 滑窗注意力, 特征融合均衡