Object Detection Based on Improved YOLOX-S Model in Construction Sites

doi:10.3778/j.issn.1673-9418.2205012

Abstract

Abstract: The existing YOLOX-S model has a low object detection average precision (AP) under the complex environmental disturbance in construction sites, which cannot well meet the needs of practical applications. In view of the above problems, the YOLOX-S model is improved from three aspects: the introduction of structural re-parameterization module, the introduction of convolutional attention module, and the introduction of AdamW optimization algorithm. Firstly, RepVGGBlock is used to decouple the model structure of the training phase and the testing phase. More residual structures are built in Backbone and Neck in the training phase to improve the model??s feature extraction capability. Secondly, the LKA (large kernel attention) module is used to extract local feature information and long-distance dependencies, providing more effective attention guidance for the subsequent calcu-lation of the position and size of bounding boxes, and improving the detection average precision. Thirdly, AdamW instead of Adam optimization algorithm is used to update the model parameters, which can further improve the model convergence results, and improve the model generalization ability. Finally, experimental results are carried out on the MOCS (moving objects in construction sites) dataset, which show that the improved YOLOX-S model??s average precision of detecting all targets is increased by 3.3 percentage points. And the average precision of detecting large objects, medium objects and small objects is increased by 3.2, 2.3, and 2.2 percentage points, respectively. At the same time, computational cost of the improved YOLOX-S model does not increase significantly, which can better meet the needs of object detection average precision in construction sites under the condition of real-time requirements.

Key words: object detection, construction sites, structural re-parameterization, large kernel attention, YOLOX-S

摘要： 现有YOLOX-S模型在施工环境干扰下目标检测平均精准率（AP）偏低，不能较好满足实际应用需要。针对上述问题，从引入结构重参数化模块、引入卷积注意力模块、引入AdamW优化算法三方面对YOLOX-S模型进行改进。首先，利用RepVGGBlock解耦训练阶段与测试阶段的模型结构，在训练阶段模型的Backbone与Neck中构建更多残差结构，提高模型的特征提取能力。其次，利用LKA模块提取局部特征信息与长距离依赖关系，为后续计算目标边界框位置与大小提供更加有效的注意力指引，提升检测平均精准率。然后，使用AdamW优化算法替代Adam优化算法更新模型参数，进一步改良模型收敛结果，提升模型泛化能力。最后，在建筑工地运动目标数据集（MOCS）上进行实验，结果表明，改进YOLOX-S模型检测所有目标的平均精准率提升3.3个百分点，检测大目标、中目标、小目标的平均精准率分别提升3.2、2.3、2.2个百分点。同时，改进YOLOX-S模型计算代价未明显增加，可在实时运行的同时更好满足施工场景下对目标检测平均精准率的需要。

关键词: 目标检测, 施工场景, 结构重参数化, 大核注意力, YOLOX-S

HU Hao, GUO Fang, LIU Zhao. Object Detection Based on Improved YOLOX-S Model in Construction Sites[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(5): 1089-1101.

胡皓, 郭放, 刘钊. 改进YOLOX-S模型的施工场景目标检测[J]. 计算机科学与探索, 2023, 17(5): 1089-1101.

References

[1] 顾晨亮, 杨恒, 刘友波, 等. 基于自适应局部斥力与归一化面积损失的工程车辆目标检测[J]. 中国安全生产科学技术, 2021, 17(11): 40-47.
GU C L, YANG H, LIU Y B, et al. Object detection of engineering vehicles based on self-adaptive local exclusion loss and normalized area loss[J]. Journal of Safety Science and Technology, 2021, 17(11): 40-47.
[2] GUO Y P, YANG X, LI S L. Dense construction vehicle detec-tion based on orientation-aware feature fusion convolutional neural network[J]. Automation in Construction, 2020, 112:103124.
[3] 谌贵辉, 易欣, 李忠兵, 等. 基于改进YOLOv2和迁移学习的管道巡检航拍图像第三方施工目标检测[J]. 计算机应用, 2020, 40(4): 1062-1068.
CHEN G H, YI X, LI Z B, et al. Third-party construction target detection in aerial images of pipeline inspection based on improved YOLOv2 and transfer learning[J]. Journal of Computer Applications, 2020, 40(4): 1062-1068.
[4] KIM D, LIU M Y, LEE S H, et al. Remote proximity moni-toring between mobile construction resources using camera-mounted UAVs[J]. Automation in Construction, 2019, 99: 168-182.
[5] 蔡振宇, 王泽锴, 陈特欢, 等. 基于YOLOv3的正下无人机视角挖掘机实时检测方法[J]. 宁波大学学报(理工版), 2021, 34(2): 42-48.
CAI Z Y, WANG Z K, CHEN T H, et al. Real-time excava-tor detection under direct UAV view based on improved YOLOv3 method[J]. Journal of Ningbo University (Natural Science & Engineering Edition), 2021, 34(2): 42-48.
[6] GE Z, LIU S T, WANG F, et al. YOLOX: exceeding YOLO series in 2021[J]. arXiv:2107.08430, 2021.
[7] HE K M, GKIOXARI G, DOLLáR P, et al. Mask R-CNN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 2980-2988.
[8] KIRILLOV A, WU Y X, HE K M, et al. PointRend: image segmentation as rendering[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 9799-9808.
[9] DING X H, ZHANG X Y, MA N N, et al. RepVGG: making VGG-style ConvNets great again[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Re-cognition, Nashville, Jun 20-25, 2021. Piscataway: IEEE, 2021: 13733-13742.
[10] GUO M H, LU C Z, LIU Z N, et al. Visual attention net-work[J]. arXiv:2202.09741, 2022.
[11] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[J]. arXiv:1711.05101, 2017.
[12] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal net-works[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2015, Montreal, Dec 7-12, 2015. Red Hook: Curran Associates, 2015: 91-99.
[13] DAI J F, LI Y, HE K M, et al. R-FCN: object detection via region-based fully convolutional networks[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2016, Barcelona, Dec 5-10, 2016. Red Hook: Curran Associates, 2016: 379-387.
[14] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 21-37.
[15] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pat-tern Recognition, Las Vegas, Jun 27-30, 2016. Washington:IEEE Computer Society, 2016: 779-788.
[16] TIAN Z, SHEN C H, CHEN H, et al. FCOS: fully convo-lutional one-stage object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 9626-9635.
[17] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//LNCS 12346: Procee-dings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 213-229.
[18] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierar-chical vision transformer using shifted windows[C]//Procee-dings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 10012-10022.
[19] REDMON J, FARHADI A. YOLOv3: an incremental improve-ment[J]. arXiv:1804.02767, 2018.
[20] BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: optimal speed and accuracy of object detection[J]. arXiv:2004.10934, 2020.
[21] ULTRALYTICS. YOLOv5[EB/OL]. (2021-04-13) [2022-04-13]. https://github.com/ultralytics/yolov5/.
[22] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//LNCS 8693: Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 5-12, 2014. Cham: Springer, 2014: 740-755.
[23] DING X H, GUO Y C, DING G G, et al. ACNet: streng-thening the kernel skeletons for powerful CNN via asym-metric convolution blocks[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 1911-1920.
[24] DING X H, ZHANG X Y, HAN J G, et al. Diverse branch block: building a convolution as an inception-like unit[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, Jun 20-25, 2021. Piscataway: IEEE, 2021: 10881-10890.
[25] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 1-9.
[26] HU J, SHEN L, SUN G, et al. Squeeze-and-excitation net-works[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Piscataway: IEEE, 2018: 7132-7141.
[27] WANG Q L, WU B G, ZHU P F, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of the 2020 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 11531-11539.
[28] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//LNCS 11211: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 3-19.
[29] HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Re-cognition, Nashville, Jun 20-25, 2021. Piscataway: IEEE, 2021: 13708-13717.
[30] KINGMA D P, BA J. Adam: a method for stochastic optimi-zation[J]. arXiv:1412.6980, 2014.
[31] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understan-ding[J]. arXiv:1810.04805, 2018.
[32] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[J]. arXiv: 2201.03545, 2022.
[33] AN X H, ZHOU L, LIU Z G, et al. Dataset and benchmark for detecting moving objects in construction sites[J]. Auto-mation in Construction, 2021, 122: 103482.
[34] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409.1556, 2014.
[35] HE K M, ZHANG X Y, REN S Q, et al. Deep residual lear-ning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer So-ciety, 2016: 770-778.
[36] HUANG Z J, HUANG L C, GONG Y C, et al. Mask sco-ring R-CNN[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 6402-6411.
[37] GHIASI G, LIN T Y, LE Q V. NAS-FPN: learning scalable feature pyramid architecture for object detection[C]//Procee-dings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Pis-cataway: IEEE, 2019: 7029-7038.
[38] WANG X L, KONG T, SHEN C H, et al. SOLO: segmenting objects by locations[C]//LNCS 12363: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 649-665.
[39] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 2999-3007.
[40] LI Y H, CHEN Y T, WANG N Y, et al. Scale-aware trident networks for object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 6053-6062.
[41] XIE S N, GIRSHICK R, DOLLáR P, et al. Aggregated residual transformations for deep neural networks[C]//Pro-ceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Wa-shington: IEEE Computer Society, 2017: 5987-5995.
[42] CHEN K, WANG J Q, PANG J M, et al. MMDetection: open MMLab detection toolbox and benchmark[J]. arXiv:1906.07155, 2019.