Dense Pedestrian Detection Based on Shifted Window Attention Multi-scale Equalization

doi:10.3778/j.issn.1673-9418.2303110

Abstract

Abstract: Due to the large differences in the shape and scale of pedestrian targets in real-world scenarios, compared with traditional methods, which often have lower average accuracy in pedestrian detection, transformer-based networks with attention mechanisms have shown strong performance in the field of pedestrian detection. However, there are still some difficulties in multi-scale detection in dense scenes. In dense scenes, there are usually a large number of occluded or small-scale pedestrian targets, leading to a large number of false and missed detections, as well as a significant amount of computing resources. Additionally, accurate detection of all targets becomes extremely difficult when pedestrian targets overlap significantly. To address these issues, a dense scene multi-scale pedestrian detection algorithm based on shifted window attention is proposed. Using modified Swin blocks in backbone enables the network to extract more detailed features while reducing the heavy computational burden brought by attention mechanisms. To effectively solve the feature fusion problem, DyHead blocks are used in the neck to unify multiple attention operations, thereby improving feature fusion efficiency. To address the feature balance issue, a feature scale-equalizing module based on full connection is designed, which constructs different residual structures between various levels of the feature pyramid to balance features and assist the model in generating higher-quality feature maps. Experimental results on the WiderPerson dataset show that this algorithm improves AP value by 1.1 percentage points, with 1.0 and 0.7 percentage points improvement in the most important small and medium targets, respectively.

Key words: multi-scale pedestrian detection, deep learning, dense scenes, shifted window attention, feature fusion and balance

摘要： 由于现实场景下的行人目标在形态、尺度等方面存在巨大差异，相比于传统方法对多尺度行人检测平均精准率较低的情况，基于Transformer注意力机制的网络在行人检测领域已经展现出强大的性能。然而，密集场景下的多尺度检测仍存在一些难点。在密集场景中，通常会包含大量的被遮挡或小规模的行人目标，导致模型产生大量的误检和漏检，同时耗费大量的计算资源。此外，当行人目标重叠较为严重时，准确地检出所有目标也会变得极为困难。为了解决上述问题，提出了一种基于滑窗注意力的密集场景多尺度行人检测算法。在Backbone中使用改进Swin block使得网络能够提取到更多的细节特征，同时减少注意力机制带来的繁重计算量。为有效解决特征融合问题，在Neck部分使用DyHead block来统一多个注意力运算，以此提高特征融合效率。针对特征均衡问题，设计了一种基于全连接的特征尺度均衡模块，通过在特征金字塔的各层级之间构造不同的残差结构来进行特征平衡，辅助模型生成更高质量的特征图。在WiderPerson数据集上的实验结果表明，该算法在AP值上提升了1.1个百分点，在最值得关注的小目标和中目标上也分别有1.0和0.7个百分点的提升。

关键词: 多尺度行人检测, 深度学习, 密集场景, 滑窗注意力, 特征融合均衡

YU Fan, ZHANG Jing. Dense Pedestrian Detection Based on Shifted Window Attention Multi-scale Equalization[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(5): 1286-1300.

于范, 张菁. 滑窗注意力多尺度均衡的密集行人检测算法[J]. 计算机科学与探索, 2024, 18(5): 1286-1300.

References

[1] 季长清, 王兵兵, 秦静, 等. 深度特征的实例图像检索算法综述[J]. 计算机科学与探索, 2023, 17(7): 1565-1575.
JI C Q, WANG B B, QIN J, et al. Survey of deep feature instance level image retrieval algorithms[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(7): 1565-1575.
[2] PANG J, CHEN K, SHI J, et al. Libra R-CNN: towards balanced learning for object detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 821-830.
[3] WANG X, ZHANG S, YU Z, et al. Scale-equalizing pyramid convolution for object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 13359-13368.
[4] ZHU X, HU H, LIN S, et al. Deformable ConvNets V2: more deformable, better results[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 9308-9316.
[5] ZHANG S, CHI C, YAO Y, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 9759-9768.
[6] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 580-587.
[7] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916.
[8] GIRSHICK R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1440-1448.
[9] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems 28, Montreal, Dec?7-12,?2015: 91-99.
[10] HE K, GKIOXARI G, DOLLáR P, et al. Mask R-CNN [C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 2961-2969.
[11] LIU W, ANGUELOV D, ERHAND D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 21-37.
[12] VASWANIA, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[13] LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchi- cal vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 10012-10022.
[14] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 213-229.
[15] ANSARI M F, LODI K A. A survey of recent trends in two-stage object detection methods[C]//Proceedings of the 2020 International Conference on Renewal Power, Jammu, Apr 17-18, 2020. Singapore: Springer, 2021: 669-677.
[16] ZHANG Y, LI X, WANG F, et al. A comprehensive review of one-stage networks for object detection[C]//Proceedings of the 2021 IEEE International Conference on Signal Processing, Communications and Computing, Xi??an, Aug 17-19, 2021. Piscataway: IEEE, 2021: 1-6.
[17] ZHANG S, YANG J, SCHIELE B. Occluded pedestrian detection through guided attention in CNNs[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Piscataway: IEEE, 2018: 6995-7003.
[18] TIAN Q, WANG M H, ZHANG Y, et al. A research for automatic pedestrian detection with ACE enhancement on fasters R-CNN[C]//Proceedings of the 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, Beijing, Oct 13-15, 2018. Piscataway: IEEE, 2018: 1-9.
[19] SHAO X, WEI J, GUO D, et al. Pedestrian detection algorithm based on improved faster RCNN[C]//Proceedings of the 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference, Chongqing, Mar 12-14, 2021. Piscataway: IEEE, 2021: 1368-1372.
[20] 音松, 陈雪云, 贝学宇. 改进Mask RCNN算法及其在行人实例分割中的应用[J]. 计算机工程, 2021, 47(6): 271-276.
YIN S, CHEN X Y, BEI X Y. Improved Mask RCNN algorithm and its application in pedestrian instance segmentation[J]. Computer Engineering, 2021, 47(6): 271-276.
[21] DONG X, HAN Y, LI W, et al. Pedestrian detection in metro station based on improved SSD[C]//Proceedings of the 2019 IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering, Dalian, Nov 14-16, 2019. Piscataway: IEEE, 2019: 936-939.
[22] BOYUAN W, MUQING W. Study on pedestrian detection based on an improved YOLOv4 alogorithm[C]//Proceedings of the 2020 IEEE 6th International Conference on Computer and Communications, Chengdu, Dec 11-14, 2020. Piscataway: IEEE, 2020: 1198-1202.
[23] DONG C, LUO X. Research on a pedestrian detection algorithm based on improved SSD network[C]//Proceedings of the 7th International Conference on Computer-Aided Design, Manufacturing, Modeling and Simulation, Busan, Nov 14-15, 2021: 032073.
[24] GUO W, SHEN N, ZHANG T. Overlapped pedestrian detection based on YOLOv5 in crowded scenes[C]//Proceedings of the 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications, Changchun, May 20-22, 2022. Piscataway: IEEE, 2022: 412-416.
[25] LIU S, HUANG D, WANG Y. Learning spatial fusion for single-shot object detection[J]. arXiv:1911.09516, 2019.
[26] QUAN Y, ZHANG D, ZHANG L, et al. Centralized feature pyramid for object detection[J]. arXiv:2210.02093, 2022.
[27] CAO Y, XU J, LIN S, et al. GCNet: non-local networks meet squeeze-excitation networks and beyond[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-28, 2019. Piscataway: IEEE, 2019: 21-29.
[28] XING H, WANG S, ZHENG D, et al. Dual attention based feature pyramid network[J]. China Communications, 2020, 17(8): 242-252.
[29] 彭豪, 李晓明. 多尺度选择金字塔网络的小样本目标检测算法[J]. 计算机科学与探索, 2022, 16(7): 1649-1660.
PENG H, LI X M. Multi-scale selection pyramid networks for small-sample target detection algorithms[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1649-1660.
[30] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[31] DAI X, CHEN Y, XIAO B, et al. Dynamic Head: unifying object detection heads with attentions[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, Jun 20-15, 2021. Piscataway: IEEE, 2021: 7373-7382.
[32] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 2117-2125.
[33] LIANG T, WANG Y, TANG Z, et al. OPANAS: one-shot path aggregation network architecture search for object detection[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, Jun 20-25, 2021. Piscataway: IEEE, 2021: 10195-10203.
[34] ZHANG S, XIE Y, WAN J, et al. Widerperson: a diverse dataset for dense pedestrian detection in the wild[J]. IEEE Transactions on Multimedia, 2019, 22(2): 380-393.
[35] CHEN K, WANG J, PANG J, et al. MMDetection: open MMLab detection toolbox and benchmark[J]. arXiv:1906.07155, 2019.
[36] TIAN Z, SHEN C, CHEN H, et al. FCOS: fully convolutional one-stage object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 9627-9636.
[37] WANG X, KONG T, SHEN C, et al. Solo: segmenting objects by locations[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 649-665.
[38] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Piscataway: IEEE, 2017: 2980-2988.
[39] CAI Z, VASCONCELOS N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Piscataway: IEEE, 2018: 6154-6162.
[40] ZHENG Z, YE R, WANG P, et al. Localization distillation for dense object detection[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 9407-9416.
[41] CHEN W, XU X, JIA J, et al. Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Jun 18-22, 2023. Piscataway: IEEE, 2023: 15050-15061.
[42] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409.1556, 2014.
[43] REDMON J, FARHADI A. YOLOv3: an incremental improvement[J]. arXiv:1804.02767, 2018.
[44] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[45] LIU Z, MAO H, WU C Y, et al. A ConvNet for the 2020s[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 11976-11986.
[46] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 5693-5703.
[47] KIRILLOV A, GIRSHICK R, HE K, et al. Panoptic feature pyramid networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 6399-6408.
[48] GHIASI G, LIN T Y, LE Q V. NAS-FPN: learning scalable feature pyramid architecture for object detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 7036-7045.