Small Object Detection Based on Two-Stage Calculation Transformer

doi:10.3778/j.issn.1673-9418.2210120

Abstract

Abstract: Despite the current small object detection task has achieved significant improvements, it still suffers from some problems. For example, it is a challenge to extract small object features because of little information in the scene of small objects, which may lose the original feature information of small object, resulting in poor detection results. To address this problem, this paper proposes a two-stage calculation Transformer (TCT) based small object detection network. Firstly, a two-stage calculation Transformer is embedded in the backbone feature extraction network for feature enhancement. Based on the traditional Transformer values computation, multiple 1D dilated convolutional layer branches with different feature fusions are utilized to implement global self-attention for the purpose of improving the feature representation and information interaction. Secondly, this paper proposes an effective residual connection module to improve the low-efficiency convolution and activation of the current CSPLayer, which helps to advance the information flow and learn more rich contextual details. Finally, this paper proposes a feature fusion and refinement module for fusing multi-scale features and improving the target feature representation capability. Quantitative and qualitative experiments on PASCAL VOC2007+2012 dataset, COCO2017 dataset and TinyPerson dataset show that the proposed algorithm has better ability of target feature extraction and higher detection accuracy for small target detection, compared with YOLOX.

Key words: YOLOX, Transformer, small object detection, feature fusion and refinement

摘要： 目前，小目标检测任务虽取得了长足发展，但仍存在诸多问题。如，小目标场景往往因为目标自身信息量过少导致目标特征提取难，容易丢失小目标原本的特征信息使得检测效果不佳。为了解决此问题，提出了一种基于两阶段计算Transformer（TCT）的小目标检测网络。首先，在主干特征提取网络中加入两阶段计算Transformer用于特征增强，在传统单阶段计算Transformer值基础上，使用多个一维空洞卷积层分支以不同的特征融合方式获得全局自注意力特征权重，提高特征表达能力与信息交互能力。其次，提出高效的残差连接模块，改进现有的CSPLayer层中低效的卷积层与激活层，有利于促进信息流的交互，学习更丰富的上下文细节特征。最后，提出特征融合与精炼方法以融合多尺度特征，提升目标特征表征能力。通过在PASCAL VOC2007+2012数据集、COCO2017数据集和TinyPerson数据集上进行多个定量与定性实验发现，相较于YOLOX算法，所提算法在小目标检测上具有更强的目标特征提取能力和更高的检测精度。

关键词: YOLOX, Transformer, 小目标检测, 特征融合与精炼

XU Shoukun, GU Jianan, ZHUANG Lihua, LI Ning, SHI Lin, LIU Yi. Small Object Detection Based on Two-Stage Calculation Transformer[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(12): 2967-2983.

徐守坤, 顾佳楠, 庄丽华, 李宁, 石林, 刘毅. 基于两阶段计算Transformer的小目标检测[J]. 计算机科学与探索, 2023, 17(12): 2967-2983.

References

[1] 李文涛, 彭力. 多尺度通道注意力融合网络的小目标检测算法[J]. 计算机科学与探索, 2021, 15(12): 2390-2400.
LI W T, PENG L. Small objects detection algorithm with multi-scale channel attention fusion network[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(12): 2390-2400.
[2] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems 28, Montreal, Dec 7-12, 2015: 91-99.
[3] PENG C, XIAO T, LI Z, et al. MegDet: a large mini-batch object detector[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 6181-6189.
[4] WANG H, WANG Q, GAO M, et al. Multi-scale location-aware kernel representation for object detection[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 1248-1257.
[5] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//LNCS 9905: Proceedings of the 14th European Conference on computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 21-37.
[6] FU C Y, LIU W, RANGA A, et al. DSSD: deconvolutional single shot detector[J]. arXiv:1701.06659, 2017.
[7] REDMON J, FARHADI A. YOLOV3: an incremental improvement[J]. arXiv:1804.02767, 2018.
[8] PENG J, WANG F, FU Z, et al. Towards toxic and narcotic medication detection with rotated object detector[J]. arXiv:2110.09777, 2021.
[9] WANG K, LIEW J H, ZOU Y, et al. PaNet: few-shot image semantic segmentation with prototype alignment[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 9196-9205.
[10] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 580-587.
[11] GIRSHICK R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1440-1448.
[12] 赵珊, 郑爱玲, 刘子路, 等. 通道分离双注意力机制的目标检测算法[J]. 计算机科学与探索, 2023, 17(5): 1112-1125.
ZHAO S, ZHENG A L, LIU Z L, et al. Object detection algorithm based on channel separation dual attention mechanism[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(5): 1112-1125.
[13] DAI J, LI Y, HE K, et al. R-FCN: object detection via region-based fully convolutional networks[C]//Advances in Neural Information Processing Systems 29, Barcelona, Dec 5-10, 2016: 379-387.
[14] SONG G, LIU Y, WANG X. Revisiting the sibling head in object detector[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 11563-11572.
[15] ZHANG H, CHANG H, MA B, et al. Dynamic R-CNN: towards high quality object detection via dynamic training[C]//LNCS 12360: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 260-275.
[16] SUN P, ZHANG R, JIANG Y, et al. Sparse R-CNN: end-to-end object detection with learnable proposals[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 14454-14463.
[17] SANG J, WU Z, GUO P, et al. An improved YOLOv2 for vehicle detection[J]. Sensors, 2018, 18(12): 4272.
[18] LU J, MA C, LI L, et al. A vehicle detection method for aerial image based on YOLO[J]. Journal of Computer and Communications, 2018, 6(11): 98-107.
[19] 邵伟平, 王兴, 曹昭睿, 等. 基于 MobileNet 与 YOLOv3 的轻量化卷积神经网络设计[J]. 计算机应用, 2020, 40(S1): 8-13.
SHAO W P, WANG X, CAO Z R, et al. Lightweight convolutional neural network design based on MobileNet and YOLOv3[J]. Journal of Computer Applications, 2020, 40(S1): 8-13.
[20] BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOV4: optimal speed and accuracy of object detection[J]. arXiv:2004.10934, 2020.
[21] 刘晋, 邓洪敏, 徐泽林, 等. 面向目标识别的轻量化混合卷积神经网络[J]. 计算机应用, 2021, 41(z2): 5-12.
LIU J, DENG H M, XU Z L, et al. Lightweight hybrid convolutional neural network for object recognition[J]. Journal of Computer Applications, 2021, 41(z2): 5-12.
[22] LIU Y, YANG F, HU P. Small-object detection in UAV-captured images via multi-branch parallel feature pyramid networks[J]. IEEE Access, 2020, 8: 145740-145750.
[23] ZOPH B, CUBUK E D, GHIASI G, et al. Learning data augmentation strategies for object detection[C]//LNCS 12372: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 566-583.
[24] ZHANG X, IZQUIERDO E, CHANDRAMOULI K. Dense and small object detection in UAV vision based on cascade network[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 118-126.
[25] 奚琦, 张正道, 彭力. 基于改进密集网络与二次回归的小目标检测算法[J]. 计算机工程, 2021, 47(4): 241-247.
XI Q, ZHANG Z D, PENG L. Small object detection algorithm based on improved dense network and quadratic regression[J]. Computer Engineering, 2021, 47(4): 241-247.
[26] 陈幻杰, 王琦琦, 杨国威, 等. 多尺度卷积特征融合的 SSD 目标检测算法[J]. 计算机科学与探索, 2019, 13(6): 1049-1061.
CHENG H J, WANG Q Q, YANG G W, et al. SSD object detection algorithm with multi-scale convolution feature fusion[J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(6): 1049-1061.
[27] 黄硕, 胡勇, 顾明剑, 等. 基于深度学习的红外遥感目标超分辨率检测算法[J]. 激光与光电子学进展, 2021, 58(16): 280-288.
HUANG S, HU Y, GU M J, et al. Super-resolution infrared remote-sensing target-detection algorithm based on deep learning[J]. Laser & Optoelectronics Progress, 2021, 58(16): 280-288.
[28] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[29] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//LNCS 12346: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 213-229.
[30] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 9992-10002.
[31] LIU S, LI F, ZHANG H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR[J]. arXiv:2201.12329, 2022.
[32] LU T W, JIA S H, ZHANG H. MemFRCN: few shot object detection with memorable Faster-RCNN[J]. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2022, 105-A(12): 1626-1630.
[33] SHETTY S. Application of convolutional neural network for image classification on Pascal VOC challenge 2012 dataset[J]. arXiv:1607.03785, 2016.
[34] GU Y, PAN Y, CHEN S. 2nd place solution to ECCV 2020 VIPriors object detection challenge[J]. arXiv:2007.08849, 2020.
[35] YU X, GONG Y, JIANG N, et al. Scale match for tiny person detection[C]//Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, Mar 1-5, 2020. Piscataway: IEEE, 2020: 1246-1254.
[36] BAE S H. Object detection based on region decomposition and assembly[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, the 31st Innovative Applications of Artificial Intelligence Conference, the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, Jan 27-Feb 1, 2019. Menlo Park: AAAI, 2019: 8094-8101.
[37] ZHENG L, FU C, ZHAO Y. Extend the shallow part of single shot multibox detector via convolutional neural network[J]. arXiv:1801.05918, 2018.
[38] CAO G, XIE X, YANG W, et al. Feature-fused SSD: fast detection for small objects[J]. arXiv:1709.05054, 2017.
[39] ZHOU P, NI B, GENG C, et al. Scale-transferrable object detection[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 528-537.
[40] TERMRITTHIKUN C, JAMTSHO Y, IEAMSAARD J, et al. EEEA-Net: an early exit evolutionary neural architecture search[J]. Engineering Applications of Artificial Intelligence, 2021, 104: 104397.
[41] SONG C, CHENG X, LIU L, et al. ACFIM: adaptively cyclic feature information-interaction model for object detection[C]//LNCS 13019: Proceedings of the 4th Chinese Conference on Pattern Recognition and Computer Vision, Beijing, Oct 29-Nov 1, 2021. Cham: Springer, 2021: 379-391.
[42] BAR A, WANG X, KANTOROV V, et al. DETReg: unsupervised pretraining with region priors for object detection[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 14585-14595.
[43] XU S, WANG X, LV W, et al. PP-YOLOE: an evolved version of YOLO[J]. arXiv:2203.16250, 2022.
[44] GU Y, LIAO X, QIN X. YouTube-GDD: a challenging gun detection dataset with rich contextual information[J]. arXiv:2203.04129, 2022.
[45] WANG C Y, BOCHKOVSKIY A, LIAO H Y M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[J]. arXiv:2207.02696, 2022.
[46] WANG C Y, BOCHKOVSKIY A, LIAO H Y M. Scaled-YOLOv4: scaling cross stage partial network[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Washington: IEEE Computer Society, 2021: 13029-13038.
[47] MENG D, CHEN X, FAN Z, et al. Conditional DETR for fast training convergence[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 3631-3640.
[48] ZHU X, SU W, LU L, et al. Deformable DETR: deformable transformers for end-to-end object detection[J]. arXiv:2010.04159, 2020.
[49] HE K, GKIOXARI G, DOLLáR P, et al. Mask R-CNN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington:IEEE Computer Society, 2017: 2980-2988.
[50] CHOI J, ELEZI I, LEE H J, et al. Active learning for deep object detection via probabilistic modeling[C]//Poceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 10244-10253.
[51] LI J, WANG Y, WANG C, et al. DSFD: dual shot face detector[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 5060-5069.