Downsampling Algorithm with Fusion of Different Receptive Field Sizes in Deep Detection Methods

doi:10.3778/j.issn.1673-9418.2308064

Abstract

Abstract: The advantage of deep detection models primarily benefits from the feature representation ability of the backbone network, where down-sampling plays a key role in semantic integration. However, existing down-sampling approaches often ignore the global structural information of features, due to the usage of the small receptive field manner. To address this issue, this paper proposes a plug-and-play dual path down-sampling method (DPDM). It improves the support of backbone network for subsequent detection, through an extra large receptive field branch. Built on the traditional small receptive field channel, DPDM constructs an efficient large receptive field branch to obtain the structural information of features. Inspired from spatial-to-depth operation, it can achieve the effectiveness of a large receptive field under a conventional convolution kernel setting. The dual-path operation increases diversity of features but doesn’t emphasize the coordination between both types of features. Therefore, DPDM subsequently uses channel concatenation and point-wise convolution techniques to merge the features of two paths. Taking the advanced YOLO as benchmark, experimental evaluations of three models (YOLOX, YOLOv5, YOLOv6) on different datasets demonstrate the effectiveness of this method in improving detection accuracy.

Key words: deep learning, deep object detection, multi-scale object detection, down-sampling strategy

摘要： 深度目标检测模型的性能优势主要受益于主干网络的特征表达能力，其中的下采样操作是执行语义集成的关键步骤。然而，现有下采样方法采用的小感受野机制，通常会导致采样特征存在全局性结构信息不足的局面。对此，提出了一种即插即用的双支路下采样方法（DPDM）。该方法采用附加大感受野采样支路的方式来改善主干网络对后期检测的支撑效果。在保留传统小感受野下采样操作的前提下，DPDM构建了一个兼顾效率的大感受野采样支路，来添加采样特征的结构性信息。该支路借鉴空间转深度操作，实现了常规小卷积核设置下的大感受野采样功能。双支路采样操作增加了采样多样性，但并未考虑两者之间的协同。因此，该方法随后采用通道拼接和逐点卷积技术，将两者进行了融合。以当前性能占据优势的YOLO系列模型为基准，在三个不同模型（YOLOX、YOLOv5、YOLOv6）及多个数据集上的实验对比，验证了该方法在改善检测精度上的效用。

关键词: 深度学习, 深度目标检测, 多尺度目标检测, 下采样策略

GU Zhenghua, LIU Gaqiong, SHAO Changbin, YU Hualong. Downsampling Algorithm with Fusion of Different Receptive Field Sizes in Deep Detection Methods[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2727-2737.

顾正华, 刘嘎琼, 邵长斌, 于化龙. 深度检测方法中融合大小感受野机制的下采样算法[J]. 计算机科学与探索, 2024, 18(10): 2727-2737.

References

[1] 李坤亚, 欧鸥, 刘广滨, 等. 改进YOLOv5的遥感图像目标检测算法[J]. 计算机工程与应用, 2023, 59(9): 207-214.
LI K Y, OU O, LIU G B, et al. Target detection algorithm of remote sensing image based on improved YOLOv5[J]. Computer Engineering and Applications, 2023, 59(9): 207-214.
[2] 崔兴超, 粟毅, 陈思伟. 融合极化旋转域特征和超像素技术的极化SAR舰船检测[J]. 雷达学报, 2021, 10(1): 14.
CUI X C, SU Y, CHEN S W. Polarimetric SAR ship detection based on polarimetric rotation domain features and superpixel technique[J]. Journal of Radars, 2021, 10(1): 14.
[3] 谢椿辉, 吴金明, 徐怀宇. 改进YOLOv5的无人机影像小目标检测算法[J]. 计算机工程与应用, 2023, 59(9): 198-206.
XIE C H, WU J M, XU H Y. Small object detection algorithm based on improved YOLOv5 in UAV image[J]. Computer Engineering and Applications, 2023, 59(9): 198-206.
[4] 苏俊楷, 段先华, 叶赵兵. 改进YOLOv5算法的玉米病害检测研究[J]. 计算机科学与探索, 2023, 17(4): 933-941.
SU J K, DUAN X H, YE Z B. Research on corn disease detection based on improved YOLOv5 algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(4): 933-941.
[5] 胡皓, 郭放, 刘钊. 改进YOLOX-S模型的施工场景目标检测[J]. 计算机科学与探索, 2023, 17(5): 1089-1101.
HU H, GUO F, LIU Z. Object detection based on improved YOLOX-S model in construction sites[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(5): 1089-1101.
[6] ZAIDI S S A, ANSARI M S, ASLAM A, et al. A survey of modern deep learning based object detection models[J]. Digital Signal Processing, 2022: 103514.
[7] REN J, WANG Y. Overview of object detection algorithms using convolutional neural networks[J]. Journal of Computer and Communications, 2022, 10(1): 115-132.
[8] DIWAN T, ANIRUDH G, TEMBHURNE J V. Object detection using YOLO: challenges, architectural successors, datasets and applications[J]. Multimedia Tools and Applications, 2023, 82(6): 9243-9275.
[9] JIANG P, ERGU D, LIU F, et al. A review of YOLO algorithm developments[J]. Procedia Computer Science, 2022, 199: 1066-1073.
[10] ZOU Z, CHEN K, SHI Z, et al. Object detection in 20 years: a survey[J]. Proceedings of the IEEE, 2023, 111(3): 257-276.
[11] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems 28, Montreal, Dec 7-12, 2015: 91-99.
[12] HE K M, GKIOXARI G, DOLLáR P, et al. Mask R-CNN [C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 2980-2988.
[13] CAI Z, VASCONCELOS N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Piscataway: IEEE, 2018: 6154-6162.
[14] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 779-788.
[15] REDMON J, FARHADI A. YOLOV3: an incremental improve-ment[EB/OL]. [2023-06-21]. https://arxiv.org/abs/1804.02767.
[16] GE Z, LIU S, WANG F, et al. YOLOX: exceeding YOLO series in 2021[EB/OL]. [2023-06-21]. https://arxiv.org/abs/2107.08430.
[17] LI C, LI L, JIANG H, et al. YOLOv6: a single-stage object detection framework for industrial applications[EB/OL].[2023-06-21]. https://arxiv.org/abs/2209.02976.
[18] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, Jun 20-25, 2005. Washington: IEEE Computer Society, 2005: 886-893.
[19] LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60: 91-110.
[20] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 2117-2125.
[21] LIU S, QI L, QIN H, et al. Path aggregation network for instance segmentation[C]//Proceedings of the 2018 IEEE Con-ference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Washington: IEEE Computer Society, 2018: 8759-8768.
[22] TAN M, PANG R, LE Q V. Efficientdet: scalable and efficient object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 10781-10790.
[23] GHIASI G, LIN T Y, LE Q V. NAS-FPN: learning scalable feature pyramid architecture for object detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 7036-7045.
[24] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 10012-10022.
[25] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 213-229.
[26] BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: optimal speed and accuracy of object detection[EB/OL].[2023-06-21]. https://arxiv.org/abs/2004.10934.
[27] WANG C Y, BOCHKOVSKIY A, LIAO H Y M. YOLOv7: trainable bag-of-freebies sets new state-of-the- art for real-time object detectors[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Jun 17-24, 2023. Piscataway: IEEE, 2023: 7464-7475.
[28] DING X, ZHANG X, HAN J, et al. Scaling up your kernels to 31×31: revisiting large kernel design in CNNs[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 11963-11975.
[29] LUO W, LI Y, URTASUN R, et al. Understanding the effective receptive field in deep convolutional neural networks[C]//Proceedings of the 30th International Conference on?Neural Information Processing Systems, Barcelona, Dec 5-10, 2016. Red Hook: Curran Associates, 2016: 4898-4906.
[30] YU F, KOLTUN V. Multi-scale context aggregation by dilated convolutions[EB/OL]. [2023-06-21]. https://arxiv.org/abs/1511.07122.
[31] MEHTA S, RASTEGARI M, CASPI A, et al. ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 552-568.
[32] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(4): 834-848.
[33] TONG K, WU Y, ZHOU F. Recent advances in small object detection based on deep learning: a review[J]. Image and Vision Computing, 2020, 97: 103910.
[34] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Piscataway: IEEE, 2015: 1-9.
[35] SAJJADI M S M, VEMULAPALLI R, BROWN M. Frame-recurrent video super-resolution[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Piscataway: IEEE, 2018: 6626-6634.
[36] CHENG G, HAN J, ZHOU P, et al. Multi-class geospatial object detection and geographic image classification based on collection of part detectors[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2014, 98: 119-132.
[37] LONG Y, GONG Y, XIAO Z, et al. Accurate object localization in remote sensing images based on convolutional neural networks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(5): 2486- 2498.
[38] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 740-755.
[39] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 21-37.
[40] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017.Washington: IEEE Computer Society, 2017: 618-626.