Video Target Detection Based on Improved YOLOV3 Algorithm

doi:10.3778/j.issn.1673-9418.2003008

Abstract

Abstract:

Pedestrian detection in monitoring has complex backgrounds, multiple target scales and poses, and occlusion between people and surrounding objects. As a result, the YOLOV3 algorithm is inaccurate in detecting some targets, which may result in false detection, missed detection, or repeated detection. Therefore, on the basis of YOLOV3??s network, using the residual structure idea, the shallow and deep features are upsampled and fused to obtain 104×104 scale detection layers. And the size of the bounding box clustered by the K-means algorithm is applied to the network layer of each scale, which increases the sensitivity of the network to multi-scale and multi-pose targets and improves the detection effect. At the same time, the YOLOV3 loss function is updated using the repulsion loss of the prediction frame to other surrounding targets, so that the prediction frame is closer to the correct target, away from the wrong target. In addition, the false detection rate of the model is reduced, so as to improve the detection effect of mutual occlusion between the targets. The experimental results prove that the proposed network model has better detection effect than the YOLOV3 algorithm on the MOT16 dataset, which proves the effectiveness of the method.

Key words: target detection, YOLOV3 algorithm, repulsion loss, deep learning, video understanding

摘要：

由于监控中的行人检测存在背景复杂，目标尺度和姿态多样性及人与周围物体互相遮挡的问题，造成YOLOV3对部分目标检测不准确，会产生误检、漏检或重复检测的情况。因此，在YOLOV3的网络基础上，利用残差结构思想，将浅层特征和深层特征进行上采样连接融合得到104×104尺度检测层，并将K-means算法聚类得到的边界框尺寸应用到各尺度网络层，增加网络对多尺度、多姿态目标的敏感度，提高检测效果。同时，利用预测框对周围其他目标的斥力损失更新YOLOV3损失函数，使预测框向正确的目标靠近，远离错误的目标，降低模型的误检率，以改善目标间互相遮挡而影响的检测效果。实验结果证明，在MOT16数据集上，相比YOLOV3算法，提出的网络模型具有更好的检测效果，证明了方法的有效性。

关键词: 目标检测, YOLOV3算法, 斥力损失, 深度学习, 视频理解

SONG Yanyan, TAN Li, MA Zihao, REN Xueping. Video Target Detection Based on Improved YOLOV3 Algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(1): 163-172.

宋艳艳, 谭励, 马子豪, 任雪平. 改进YOLOV3算法的视频目标检测[J]. 计算机科学与探索, 2021, 15(1): 163-172.

References

[1] VIOLA P, JONES M J. Robust real-time face detection[J]. International Journal of Computer Vision, 2004, 57(2): 137-154.
[2] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//Proceedings of the 2005 IEEE Com-puter Society Conference on Computer Vision and Pattern Recognition, San Diego, Jun 20-26, 2005. Washington: IEEE Computer Society, 2005: 886-893.
[3] FELZENSZWALB P F, MCALLESTER D A, RAMANAN D. A discriminatively trained, multiscale, deformable part model[C]//Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Anchorage, Jun 24-26, 2008. Washington: IEEE Computer Society, 2008: 1-8.
[4] GIRSHICK R B, DONAHUE J, DARRELLAND T, et al. Rich feature hierarchies for accurate detection and semantic seg-mentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 580-587.
[5] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916.
[6] GIRSHICK R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Piscataway: IEEE, 2015: 1440-1448.
[7] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 2015, 39(6): 1137-1149.
[8] DAI J F, LI Y, HE K M, et al. R-FCN: object detection via region-based fully convolutional networks[C]//Proceedings of the 2016 Annual Conference on Neural Information Process-ing Systems, Barcelona, Dec 5-10, 2016. Red Hook: Curran Associates, 2016: 379-387.
[9] REDMON J, DIVVALA S K, GIRSHICK R B, et al. You only look once: unified, real-time object detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 779-788.
[10] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Berlin, Heidelberg: Springer, 2016: 21-37.
[11] HUANG L, YANG Y, DENG Y, et al. DenseBox: unifying landmark localization with end to end object detection[J]. arXiv:1509.04874, 2015.
[12] REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 6517-6525.
[13] REDMON J, FARHADI A. YOLOV3: an incremental improve-ment[J]. arXiv:1804.02767, 2018.
[14] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion, Las Vegas, Jun 27-30, 2016. Washington: IEEE Com-puter Society, 2016: 770-778.
[15] WANG D W, YANG X, HAN P F, et al. Panoramic video mo-tion small target detection algorithm in complex background[J/OL]. Control and Decision (2019-09-29) [2019-11-05].https://doi.org/10.13195/j.kzyjc.2019.0686.
王殿伟, 杨旭, 韩鹏飞, 等. 复杂背景下全景视频运动小目标检测算法[J/OL]. 控制与决策(2019-09-29) [2019-11-05]. https://doi.org/10.13195/j.kzyjc.2019.0686.
[16] GE W, SHI Z W. Application of improved YOLOV3 algo-rithm in pedestrian identification[J]. Computer Engineering and Applications, 2019, 55(20): 128-133.
葛雯, 史正伟. 改进YOLOV3算法在行人识别中的应用[J]. 计算机工程与应用, 2019, 55(20): 128-133.
[17] KONG F F, SONG B B. Improved YOLOV3 panoramic traffic monitoring target detection[J]. Computer Engineering and Applications, 2020, 56(8): 20-25.
孔方方, 宋蓓蓓. 改进YOLOV3的全景交通监控目标检测[J]. 计算机工程与应用, 2020, 56(8): 20-25.
[18] WU D, WU Q, YIN X, et al. Lameness detection of dairy cows based on the YOLOV3 deep learning algorithm and a relative step size characteristic vector[J]. Biosystems Engi-neering, 2020, 189: 150-163.
[19] GAO Z, LI S B, CHEN J N, et al. Pedestrian detection method based on YOLO network[J]. Computer Engineering, 2018, 44(5): 215-219.
高宗, 李少波, 陈济楠, 等. 基于YOLO网络的行人检测方法[J]. 计算机工程, 2018, 44(5): 215-219.
[20] WANG X K, JIANG H X, LIN K Y. Remote sensing image ship detection based on modified YOLO algorithm[J]. Journal of Beijing University of Aeronautics and Astronautics, 2020(6): 1184-1191.
王玺坤, 姜宏旭, 林珂玉. 基于改进型YOLO算法的遥感图像舰船检测[J]. 北京航空航天大学学报, 2020(6): 1184-1191.
[21] LI D J, LI D G, YANG L. Application of convolutional neural network in dynamic gesture tracking[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(5): 841-847.
李东洁, 李东阁, 杨柳. 卷积神经网络在动态手势跟踪中的应用[J]. 计算机科学与探索, 2020, 14(5): 841-847.
[22] WANG X L, XIAO T T, JIANG Y N, et al. Repulsion loss: detecting pedestrians in a crowd[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018. Washington: IEEE Computer Society, 2018: 7774-7783.
[23] MILAN A, LEAL-TAIXE L, REID I, et al. MOT16: a bench-mark for multi-object tracking[J]. arXiv:1603.00831, 2016.