CARFB: Plug-and-Play Object Detection Module

doi:10.3778/j.issn.1673-9418.2401082

Abstract

Abstract: To overcome the limitations of CA (coordinate attention) that may lose the significant features of targets in the average pooling of horizontal and vertical features, and the insufficient learning to small target features using two-dimensional ordinary convolution, the CARFB (coordinate attention and receptive field block) is proposed. In this CARFB, the maximal pooling is introduced to enhance the average pooling of CA, so as to retain significant and detailed information of input features in horizontal and vertical directions. The advantage of RFB (receptive field block) possessing different sizes of receptive fields is used to replace the convolution of CA for the concatenated features of the horizontal and vertical features, so as to extract features of different sizes of targets simultaneously. CBS (convolution + batch normalization + SiLU) module containing convolution kernel with different sizes and steps is introduced to replace the two-dimensional ordinary convolution of CA, so as to further extract horizontal and vertical features and obtain reweighted output features. CARFB module saves target position information in horizontal and vertical directions, and extracts strong distinguishable features of different sizes of targets through different receptive fields, so as to obtain strong capability for feature learning. To verify the performance of this proposed plug-and-play CARFB module, it is embedded into object detector ObjectBox, resulting in the ObjectBox-CARFB detector. Moreover, it is utilized to replace the RFB module in RFB net, resulting in the CARFB net target detector. Experiments on MS COCO dataset show that the performance of ObjectBox-CARFB model is improved comprehensively, especially for detecting small targets. Experiments on PASCAL VOC and MS COCO datasets demonstrate that CARFB net300 and CARFB net512 are respectively superior to original RFB net300 and RFB net512 and other compared peers. The proposed CARFB module has stronger feature learning capability and can achieve better detection effect on different sizes of targets, especially in the detection of small targets. CARFB module can be embedded into any other convolutional neural network to enhance the performance of the original network. It has stronger feature learning capability, and can store more target information, and can achieve better detection effect on targets with different sizes, particularly for detecting small targets.

Key words: object detection, receptive field block (RFB), coordinate attention, small targets, deep learning

摘要： 针对坐标注意力（CA）在水平和垂直方向特征的平均池化可能丢失目标显著特征，以及使用二维普通卷积对小目标特征学习不足的情况，提出了CARFB（coordinate attention and receptive field block）模块。该模块将CA的平均池化修改为平均+最大池化，以保留输入特征在水平和垂直方向的显著和细节信息；利用RFB具有不同大小感受野的优势，在水平和垂直方向分别使用RFB模块代替CA的融合特征统一卷积，以同时提取不同大小目标的特征；引入包含不同大小卷积核和步长的CBS模块，替换CA的二维普通卷积，进一步提取水平和垂直方向的特征，得到重新加权的输出特征。CARFB模块在水平和垂直方向保存目标位置信息，利用不同感受野提取不同大小目标的强辨别性特征，从而具有更强的特征学习能力。为了验证提出的即插即用模块CARFB的性能，将其嵌入ObjectBox目标检测框架，得到ObjectBox-CARFB模型；用CARFB模块替换RFB net中的RFB模块，得到CARFB net目标检测模型。MS COCO数据集的实验测试表明，ObjectBox-CARFB模型的性能得到全面提升，尤其对小目标的检测性能提升突出；PASCAL VOC和MS COCO数据集的实验结果表明，CARFB net300和CARFB net512的目标检测能力分别优于原始RFB net300和RFB net512模型，并优于其他同系列对比模型。提出的CARFB模块具有更强的特征学习能力，对不同尺度目标均能取得较好的检测效果，特别是在小目标检测方面，效果提升显著。提出的CARFB模块可以嵌入到任何一个卷积神经网络，能保存更多的目标信息，具有更强的特征学习能力和更高的网络性能，对不同尺度目标均能取得较好的检测效果，尤其对小目标的检测效果提升显著。

关键词: 目标检测, 感受野模块（RFB）, 坐标注意力, 小目标, 深度学习

YANG Meijun, YAO Ruoxia, XIE Juanying. CARFB: Plug-and-Play Object Detection Module[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(1): 223-236.

杨梅君, 姚若侠, 谢娟英. CARFB：即插即用的目标检测模块[J]. 计算机科学与探索, 2025, 19(1): 223-236.

References

[1] ZAIDI S S A, ANSARI M S, ASLAM A, et al. A survey of modern deep learning based object detection models[J]. Digital Signal Processing, 2022, 126: 103514.
[2] 谢娟英, 刘然. 基于深度学习的目标检测算法研究进展[J]. 陕西师范大学学报(自然科学版), 2019, 47(5): 1-9.
XIE J Y, LIU R. The study progress of object detection algorithms based on deep learning[J]. Journal of Shaanxi Normal University (Natural Science Edition), 2019, 47(5): 1-9.
[3] 谢娟英, 鲁银圆, 孔维轩, 等. 基于改进RetinaNet的自然环境中蝴蝶种类识别[J]. 计算机研究与发展, 2021, 58(8): 1686-1704.
XIE J Y, LU Y Y, KONG W X, et al. Butterfly species identi-fication from natural environment based on improved RetinaNet[J]. Journal of Computer Research and Development, 2021, 58(8): 1686-1704.
[4] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444.
[5] DAI J, HE K, SUN J. Instance-aware semantic segmentation via multi-task network cascades[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 3150-3158.
[6] HARIHARAN B, ARBELÁEZ P, GIRSHICK R, et al. Simultaneous detection and segmentation[C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer, 2014: 297-312.
[7] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676.
[8] WU Q, SHEN C, WANG P, et al. Image captioning and visual question answering based on attributes and external knowledge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1367-1381.
[9] KANG K, LI H, YAN J, et al. T-CNN: tubelets with convolutional neural networks for object detection from videos[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(10): 2896-2907.
[10] ZOU Z, CHEN K, SHI Z, et al. Object detection in 20 years: a survey[J]. Proceedings of the IEEE, 2023, 111(3): 257-276.
[11] XIE J, KONG W, LU Y, et al. KSRFB-net: detecting and identifying butterflies in ecological images based on human visual mechanism[J]. International Journal of Machine Learning and Cybernetics, 2022, 13(10): 3143-3158.
[12] DUAN K, BAI S, XIE L, et al. CenterNet: keypoint triplets for object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 6568-6577.
[13] LAW H, DENG J. CornerNet: detecting objects as paired keypoints[J]. International Journal of Computer Vision, 2020, 128(3): 642-656.
[14] TIAN Z, SHEN C, CHEN H, et al. FCOS: fully convolutional one-stage object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 9626-9635.
[15] DUAN K, XIE L, QI H, et al. Corner proposal network for anchor-free, two-stage object detection[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 399-416.
[16] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 213-229.
[17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[18] HU J, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023.
[19] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[20] HOU Q, ZHOU D, FENG J. Coordinate attention for efficient mobile network design[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 13708-13717.
[21] ZAND M, ETEMAD A, GREENSPAN M. ObjectBox: from centers to boxes for anchor-free object detection[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2022: 390-406.
[22] EVERINGHAM M, VAN GOOL L, WILLIAMS C K I, et al. The pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2): 303-338.
[23] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[24] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 580-587.
[25] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916.
[26] GIRSHICK R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 1440-1448.
[27] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[28] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 936-944.
[29] DAI J, LI Y, HE K, et al. R-FCN: object detection via region-based fully convolutional networks[EB/OL]. [2023-12-20]. https://arxiv.org/abs/1605.06409.
[30] HE K, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2980-2988.
[31] GHIASI G, LIN T Y, LE Q V. NAS-FPN: learning scalable feature pyramid architecture for object detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 7029-7038.
[32] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 779-788.
[33] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016: 21-37.
[34] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2999-3007.
[35] REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 6517-6525.
[36] REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. [2023-12-22]. https://arxiv.org/abs/1804. 02767.
[37] BOCHKOVSKIY A, WANG C Y, LIAO H M. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. [2023-12-22]. https://arxiv.org/abs/2004.10934.
[38] LIU S, HUANG D, WANG Y. Receptive field block net for accurate and fast object detection[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 404-419.
[39] ZHANG S, WEN L, BIAN X, et al. Single-shot refinement neural network for object detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4203-4212.
[40] ZHAO Q, SHENG T, WANG Y, et al. M2Det: a single-shot object detector based on multi-level feature pyramid network[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 9259-9266.
[41] TAN M, PANG R, LE Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10778-10787.
[42] GE Z, LIU S, WANG F, et al. YOLOX: exceeding YOLO series in 2021[EB/OL]. [2023-12-21]. https://arxiv.org/abs /2107.08430.
[43] CHEN Q, WANG Y, YANG T, et al. You only look one-level feature[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 13034-13043.
[44] WANG C Y, BOCHKOVSKIY A, LIAO H M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 7464-7475.
[45] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 1-9.
[46] REIS D, KUPEC J, HONG J, et al. Real-time flying object detection with YOLOv8[EB/OL]. [2023-12-21]. https://arxiv. org/abs/2305.09972.
[47] BENJUMEA A, TEETI I, CUZZOLIN F, et al. YOLO-Z: improving small object detection in YOLOv5 for autonomous vehicles[EB/OL]. [2023-12-21]. https://arxiv.org/abs /2112.11798.
[48] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9992-10002.
[49] ZHOU X, WANG D, KRäHENBüHL P. Objects as Points[EB/OL]. [2023-12-22]. https://arxiv.org/abs /1904.07850.