Research on Lightweight Model of Multi-person Pose Estimation Based on Improved YOLOv8s-Pose

doi:10.3778/j.issn.1673-9418.2403059

Abstract

Abstract: To address the issues of high computational load and slow detection speed in existing human pose estimation models, this paper proposes a lightweight improved algorithm based on the YOLOv8s-Pose model. Firstly, a lightweight module C2f-GhostNetBottleNeckV2 is introduced into the backbone to replace the original C2f, reducing the number of parameters. This paper also introduces the Non_Local attention mechanism to integrate the position information of human key points in the image into the channel dimension, thereby enhancing the efficiency of feature extraction and mitigating the accuracy degradation issues that often occur after model lightweighting. Furthermore, the weighted bidirectional feature pyramid network is incorporated into the neck layer to improve the model’s feature fusion capabilities, ensuring a good balance when processing features of different scales. A small object detection head is then added to the network to reduce the missed detection of small objects. Lastly, the CIOU loss function is replaced with Focal-EIOU to enhance the accuracy of human key point regression. Experimental results show that the improved model reduces the number of parameters by 9.3%, and compared with the original model on the COCO2017 human key points dataset, it achieves an improvement of 0.4 percentage points in mAP@0.50 and an improvement of 0.6 percentage points in mAP@0.50:0.95. Therefore, the proposed lightweight improvement algorithm not only reduces the number of model parameters but also enhances the accuracy of human pose estimation algorithms, especially for small target detection, which provides an effective means to achieve real-time and accurate pose estimation.

Key words: pose estimation, YOLOv8s-Pose, GhostNetV2 network, weighted bidirectional feature pyramid network, loss function

摘要： 针对现有人体姿态估计模型计算量大、检测速度慢等问题，提出了一种基于YOLOv8s-Pose模型的轻量化改进算法。在backbone中引入轻量化模块C2f-GhostNetBottleNeckV2替换原先C2f，减少参数量，提高模型速度。引入Non_Local注意力机制捕捉并传递人体关键点位置，直接融合全面的信息，为后续的层级提供更为丰富和深入的语义信息，提升整体的信息处理深度和广度，强化特征提取的效能，减少模型轻量化后精度降低问题，再将neck层引入加权双向特征金字塔网络，通过双向融合的理念，对自顶向下和自底向上的信息流动路径进行了重新规划，确保在处理不同尺度的特征信息时达到良好的平衡，给网络增加一个小目标检测头，减少对小目标的漏检情况，将CIOU损失函数更换为Focal-EIOU损失函数，以增强对复杂场景和多目标场景下的鲁棒性。实验结果表明，改进后的实验模型参数量降低了9.3%，在COCO2017人体关键点数据集上，与原模型相比mAP@0.50提升了0.4个百分点，mAP@0.50:0.95提升了0.6个百分点。可见，所提出的轻量化改进算法在减少模型参数量的同时，提升了人体姿态估计的算法精度，尤其对小目标检测有显著改善，为实现实时准确的姿态估计提供了有效手段。

关键词: 姿态估计, YOLOv8s-Pose, GhostNetV2网络, 加权双向特征金字塔网络, 损失函数

FU Yu, GAO Shuhui. Research on Lightweight Model of Multi-person Pose Estimation Based on Improved YOLOv8s-Pose[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(3): 682-692.

傅裕, 高树辉. 改进YOLOv8s-Pose多人姿态估计轻量化模型研究[J]. 计算机科学与探索, 2025, 19(3): 682-692.

References

[1] TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 1653-1660.
[2] ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: new benchmark and state of the art analysis[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 3686-3693.
[3] SUN X, XIAO B, WEI F Y, et al. Integral human pose regression[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 536-553.
[4] LI J F, CHEN T, SHI R Q, et al. Localization with sampling-argmax[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. New York: ACM, 2024: 27236-27248.
[5] WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 4724-4732.
[6] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016: 483-499.
[7] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5693-5703.
[8] LI Y J, ZHANG S K, WANG Z C, et al. TokenPose: learning keypoint tokens for human pose estimation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 11293-11302.
[9] ZHANG F, ZHU X T, YE M. Fast human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3517-3526.
[10] LI Z, YE J W, SONG M L, et al. Online knowledge distillation for efficient pose estimation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 11720-11730.
[11] YU C Q, XIAO B, GAO C X, et al. Lite-HRNet: a lightweight high-resolution network[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 10440-10450.
[12] WANG Y H, LI M Y, CAI H, et al. Lite Pose: efficient architecture design for 2D human pose estimation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 13116-13126.
[13] GIRSHICK R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 1440-1448.
[14] 闫芬婷, 王鹏, 吕志刚, 等. 基于视频的实时多人姿态估计方法[J]. 激光与光电子学进展, 2020, 57(2): 97-104.
YAN F T, WANG P, LÜ Z G, et al. Real-time multi-person video-based pose estimation[J]. Laser & Optoelectronics Pro-gress, 2020, 57(2): 97-104.
[15] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 779-788.
[16] WANG X L, SHRIVASTAVA A, GUPTA A. A-Fast-RCNN: hard positive generation via adversary for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 3039-3048.
[17] HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2980-2988.
[18] PISHCHULIN L, INSAFUTDINOV E, TANG S Y, et al. DeepCut: joint subset partition and labeling for multi person pose estimation[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 4929-4937.
[19] CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7103-7112.
[20] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848.
[21] JIN X Q, ZHANG D W, WU Q E, et al. Improved SiamCAR with ranking-based pruning and optimization for efficient UAV tracking[J]. Image and Vision Computing, 2024, 141: 104886.
[22] NING Z P, WANG H, LI S L, et al. YOLOv7-RDD: a lightweight efficient pavement distress detection model[J]. IEEE Transactions on Intelligent Transportation Systems, 2024, 25(7): 6994-7003.
[23] 张剑锐, 魏霞, 张林鍹, 等. 改进YOLO v7的绝缘子检测与定位[J]. 计算机工程与应用, 2024, 60(4): 183-191.
ZHANG J R, WEI X, ZHANG L X, et al. Improving detection and positioning of insulators in YOLO v7[J]. Computer Engineering and Applications, 2024, 60(4): 183-191.
[24] 张利丰, 田莹. 改进YOLOv8的多尺度轻量型车辆目标检测算法[J]. 计算机工程与应用, 2024, 60(3): 129-137.
ZHANG L F, TIAN Y. Improved YOLOv8 multi-scale and lightweight vehicle object detection algorithm[J]. Computer Engineering and Applications, 2024, 60(3): 129-137.
[25] 王红霞, 李枝峻, 顾鹏. 基于YOLOPose的人体姿态估计轻量级网络[J]. 沈阳理工大学学报, 2023, 42(6): 10-16.
WANG H X, LI Z J, GU P. A lightweight network for human pose estimation based on YOLOPose[J]. Journal of Shenyang Ligong University, 2023, 42(6): 10-16.
[26] TANG Y, HAN K, GUO J, et al. GhostNetv2: enhance cheap operation with long-range attention[C]//Advances in Neural Information Processing Systems 35, 2022: 9969-9982.
[27] WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7794-7803.
[28] TAN M X, PANG R M, LE Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10781-10790.
[29] WANG Z J, MA L Z, LIN X, et al. MSGC: a new bottom-up model for salient object detection[C]//Proceedings of the 2018 IEEE International Conference on Multimedia and Expo. Piscataway: IEEE, 2018: 1-6.
[30] LIN X, WANG Z J, MA L Z, et al. Salient object detection based on multiscale segmentation and fuzzy broad learning[J]. The Computer Journal, 2022, 65(4): 1006-1019.
[31] 赵宏, 冯宇博. 基于CGS-Ghost YOLO的交通标志检测研究[J]. 计算机工程, 2023, 49(12): 194-204.
ZHAO H, FENG Y B. Research on traffic sign detection based on CGS-ghost YOLO[J]. Computer Engineering, 2023, 49(12): 194-204.
[32] ZHANG Y F, REN W Q, ZHANG Z, et al. Focal and efficient IOU loss for accurate bounding box regression[J]. Neurocomputing, 2022, 506: 146-157.
[33] TIAN Y J, SU D, LAURIA S, et al. Recent advances on loss functions in deep learning for computer vision[J]. Neuro-computing, 2022, 497: 129-158.
[34] CAO Z, SIMON T, WEI S H, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1302-1310.
[35] CHENG B W, XIAO B, WANG J D, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5385-5394.
[36] CAO X S, SHI Y L, YU H, et al. DEKR: description enhanced knowledge graph for machine learning method recommendation[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2021: 203-212.
[37] NEFF C, SHETH A, FURGURSON S, et al. EfficientHRNet: efficient scaling for lightweight high-resolution multi-person pose estimation[EB/OL]. [2024-01-15]. https://arxiv.org/abs/2007.08090.