协同级联网络和对抗网络的目标检测

doi:10.3778/j.issn.1673-9418.2007059

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (1): 217-230.DOI: 10.3778/j.issn.1673-9418.2007059

协同级联网络和对抗网络的目标检测

李志欣¹^,⁺(), 陈圣嘉¹, 周韬¹, 马慧芳²

1.广西师范大学广西多源信息挖掘与安全重点实验室,广西桂林 541004
2.西北师范大学计算机科学与工程学院,兰州 730070

收稿日期:2020-07-03 修回日期:2020-09-09 出版日期:2022-01-01 发布日期:2020-09-25
通讯作者: + E-mail: lizx@gxnu.edu.cn
作者简介:李志欣（1971—）,男,博士,教授,博士生导师,CCF高级会员,主要研究方向为图像理解、机器学习、自然语言处理、跨媒体计算。
陈圣嘉（1994—）,男,硕士,主要研究方向为机器学习、图像理解。
周韬（1993—）,男,硕士,主要研究方向为机器学习、图像理解。
马慧芳（1981—）,女,博士,教授,硕士生导师,CCF会员,主要研究方向为数据挖掘、自然语言处理。
基金资助:
国家自然科学基金(61966004);国家自然科学基金(61663004);国家自然科学基金(61762078);国家自然科学基金(61866004);广西自然科学基金(2019GXNSFDA245018);广西自然科学基金(2018GXNSFDA281009);广西自然科学基金(2017GXNSFAA198365);广西研究生教育创新计划项目(YCSW2020111)

Combining Cascaded Network and Adversarial Network for Object Detection

LI Zhixin¹^,⁺(), CHEN Shengjia¹, ZHOU Tao¹, MA Huifang²

1. Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin, Guangxi 541004, China
2. College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China

Received:2020-07-03 Revised:2020-09-09 Online:2022-01-01 Published:2020-09-25
About author:LI Zhixin, born in 1971, Ph.D., professor, Ph.D. supervisor, senior member of CCF. His research interests include image understanding, machine learning, natural language processing and cross-media computing.
CHEN Shengjia, born in 1994, M.S. His research interests include machine learning and image understanding.
ZHOU Tao, born in 1993, M.S. His research interests include machine learning and image understanding.
MA Huifang, born in 1981, Ph.D., professor, M.S. supervisor, member of CCF. Her research interests include data mining and natural language processing.
Supported by:
National Natural Science Foundation of China(61966004);National Natural Science Foundation of China(61663004);National Natural Science Foundation of China(61762078);National Natural Science Foundation of China(61866004);Natural Science Foundation of Guangxi(2019GXNSFDA245018);Natural Science Foundation of Guangxi(2018GXNSFDA281009);Natural Science Foundation of Guangxi(2017GXNSFAA198365);Innovation Project of Guangxi Graduate Education(YCSW2020111)

摘要/Abstract

摘要：

识别多尺度目标和遮挡目标是目标检测中的重点和难点。为了检测不同大小的目标,目标检测器通常利用卷积神经网络（CNN）的多尺度特征图层次结构,然而这种自顶向下的结构由于底层特征图的卷积层较小,缺乏获取小目标特征所需的细节信息,这些目标检测器的性能受到了限制。为此,结合Faster R-CNN框架提出Collaborative R-CNN,设计了一种级联网络结构,可以融合多尺度特征图,以生成深度融合的特征信息来增强小目标所需的细节特征,从而提高检测小目标的能力。此外,由于使用RoIPooling过程中的量化会对小目标检测造成极大的限制,为进一步提高方法的鲁棒性,设计了多尺度RoIAlign来消除这种量化,并通过多尺度的池化来提高网络检测不同尺度目标的能力。最后,将对抗网络与所提出的级联网络相结合,生成包含遮挡目标的训练样本,可显著提高模型的分类能力和识别遮挡目标的鲁棒性。在PASCAL VOC 2012和PASCAL VOC 2007数据集上的实验结果表明,提出的方法优于许多先进的方法。

关键词: 目标检测, 卷积神经网络（CNN）, 特征融合, 级联网络, 对抗网络

Abstract:

Recognizing multi-scale objects and objects with occlusions is a key and difficult point of task in object detection. In order to detect objects with different sizes, the object detector usually uses the hierarchical structure of multi-scale feature map constructed by convolutional neural network (CNN). However, due to the small convolution layer of the bottom feature map, the top-down structure lacks the detailed information needed to capture the features of small object. The performance of these object detectors is limited. Therefore, based on the Faster R-CNN (region-convolutional neural network) framework, this paper proposes Collaborative R-CNN. This paper designs a cascaded network structure that integrates multi-scale feature maps to generate deeply fused feature information and thereby improving the ability to detect small objects. Moreover, the quantization in the RoIPooling process greatly limits the recognition ability of small objects. In order to further improve the robustness of the method, a multi-scale RoIAlign is designed to eliminate such quantization, and the ability of network to detect objects with different scales is improved by multi-scale pooling. Finally, this paper combines an adversarial network with the proposed network to generate training samples with occlusions, significantly improving the classification ability of the model, and robustness to detect occlusions. Experimental results for the PASCAL VOC 2012 and PASCAL VOC 2007 datasets demonstrate the superiority of proposed approach relative to several state-of-the-art approaches.

Key words: object detection, convolutional neural network (CNN), feature fusion, cascaded network, adversarial network

中图分类号:

TP391

李志欣, 陈圣嘉, 周韬, 马慧芳. 协同级联网络和对抗网络的目标检测[J]. 计算机科学与探索, 2022, 16(1): 217-230.

LI Zhixin, CHEN Shengjia, ZHOU Tao, MA Huifang. Combining Cascaded Network and Adversarial Network for Object Detection[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(1): 217-230.

图/表 13

参考文献 39

[1]	GIRSHICK R B, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 580-587.
[2]	GIRSHICK R B. Fast R-CNN[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1440-1448.
[3]	REN S Q, HE K M, GIRSHICK R B, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]// Proceedings of the Annual Conference on Neural In-formation Processing Systems, Montreal, Dec 7-12, 2015. Red Hook: Curran Associates, 2015: 91-99.
[4]	RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3):211-252. DOI URL
[5]	WEI S T, LI Z X, ZHANG C L. Combined constraint-based with metric-based in semi-supervised clustering ensemble[J]. International Journal of Machine Learning and Cybernetics, 2018, 9(7):1085-1100. DOI URL
[6]	WEI Y C, XIA W, LIN M, et al. HCP: a flexible CNN frame-work for multi-label image classification[J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2015, 38(9):1901-1907.
[7]	ZHENG Y Z, LI Z X, ZHANG C L. A hybrid architecture based on CNN for cross-modal semantic instance annotation[J]. Multimedia Tools and Applications, 2018, 77(7):8695-8710. DOI URL
[8]	DAI J F, LI Y, HE K M, et al. R-FCN: object detection via region-based fully convolutional networks[C]// Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Dec 5-10, 2016. Red Hook: Curran Associates, 2016: 379-387.
[9]	KONG T, YAO A B, CHEN Y R, et al. HyperNet: towards accurate region proposal generation and joint object detection[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 845-853.
[10]	SERMANET P, EIGEN D, ZHANG X, et al. OverFeat: integrated recognition, localization and detection using con-volutional networks[J]. arXiv:1312.6229, 2013.
[11]	LIN T Y. DOLLÁR P, GIRSHICK R B, et al. Feature pyramid networks for object detection[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 936-944.
[12]	HE K M, ZHANG X Y, REN S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recogni-tion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9):1904-1916. DOI URL
[13]	EVERINGHAM M, VAN GOOL L, WILLIAMS C I, et al. The PASCAL visual object classes (VOC) challenge[J]. Inter-national Journal of Computer Vision, 2010, 88(2):303-338.
[14]	UIJLINGS J R R, VAN DE SANDE K E A, GEVERS T, et al. Selective search for object recognition[J]. International Journal of Computer Vision, 2013, 104(2):154-171. DOI URL
[15]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]// LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 21-37.
[16]	REDMON J, DIVVALA S K, GIRSHICK R B, et al. You only look once: unified, real-time object detection[C]// Pro-ceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Wash-ington: IEEE Computer Society, 2016: 779-788.
[17]	KONG T, SUN F C, YAO A B, et al. RON: reverse connec-tion with objectness prior networks for object detection[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 5244-5252.
[18]	刘云, 钱美伊, 李辉, 等. 深度学习的多尺度多人目标检测方法研究[J]. 计算机工程与应用, 2020, 56(6):172-179.
	LIU Y, QIAN M Y, LI H, et al. Research on multi-scale and multi-human detection method of deep learning[J]. Computer Engineering and Applications, 2020, 56(6):172-179.
[19]	杨雅茹, 邓红霞, 王哲, 等. 浅层特征融合引导的深层网络行人检测[J]. 计算机工程与应用, 2020, 56(2):196-200.
	YANG Y R, DENG H X, WANG Z, et al. Deep network pedestrian detection guided by shallow feature fusion[J]. Computer Engineering and Applications, 2020, 56(2):196-200.
[20]	陈幻杰, 王琦琦, 杨国威, 等. 多尺度卷积特征融合的SSD目标检测算法[J]. 计算机科学与探索, 2019, 13(6):1049-1061.
	CHEN H J, WANG Q Q, YANG G W, et al. SSD object detection algorithm with multi-scale convolution feature fusion[J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(6):1049-1061.
[21]	LIU Y, WANG R P, SHAN S G, et al. Structure inference net: object detection using scene-level context and instance-level relationships[C]// Proceedings of the 2018 IEEE Con-ference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 6985-6994.
[22]	HE C H, LAI S C, LAM K M, et al. Improving object detection with relation graph inference[C]// Proceedings of the 2019 International Conference on Acoustics Speech and Signal Processing, Brighton, May 12-17, 2019. Piscataway: IEEE, 2019: 2537-2541.
[23]	REDMON J, FARHADI A. YOLOv3: an incremental improve- ment[J]. arXiv:1804.02767, 2018.
[24]	ZHOU X Y, WANG D Q, KRÄHENBÜHL P. Objects as points[J]. arXiv:1904.07850, 2019.
[25]	WANG X L, SHRIVASTAVA A, GUPTA A. A-Fast-RCNN: hard positive generation via adversary for object detection[C]// Proceedings of the 2017 IEEE Conference on Comp-uter Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3039-3048.
[26]	ZHOU T, LI Z X, ZHANG C L, et al. An improved convo-lutional neural network model with adversarial net for multi-label image classification[C]// LNCS 11013: Proceedings of the 15th Pacific Rim International Conference on Artificial Intelligence, Nanjing, Aug 28-31, 2018. Cham: Springer, 2018: 38-46.
[27]	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7103-7112.
[28]	SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, Inception-ResNet and the impact of residual connections on learning[J]. arXiv:1602.07261, 2016.
[29]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409.1556, 2014
[30]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Image-Net classification with deep convolutional neural networks[C]// Proceedings of the Advances in Neural Information Processing Systems. Red Hook: Curran Associates, 2012: 1106-1114.
[31]	OQUAB M, BOTTOU L, LAPTEV I, et al. Learning and transferring mid-level image representations using convolu-tional neural networks[C]// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 1717-1724.
[32]	ZEILER M D, FERGUS R. Visualizing and understanding convolutional networks[C]// LNCS 8689: Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 818-833.
[33]	YOSINSKI J, CLUNE J, BENGIO Y, et al. How transfer-able are features in deep neural networks?[C]// Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Red Hook: Curran Ass-ociates, 2014: 3320-3328.
[34]	ZHANG C L, LUO J H, WEI X S, et al. In defense of fully connected layers in visual representation transfer[C]// LNCS 10736: Proceedings of the 18th Pacific-Rim Conference on Multimedia Advances in Multimedia Information Processing, Harbin, Sep 28-29, 2017. Cham: Springer, 2017: 807-817.
[35]	HE K M, GKIOXARI G. DOLLÁR P, et al. Mask R-CNN[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 2961-2969.
[36]	JIANG Y, ZHU X, WANG X, et al. R2CNN: rotational region CNN for orientation robust scene text detection[J]. arXiv:1706.09579, 2017.
[37]	HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 2261-2269.
[38]	REN S Q, HE K M, GIRSHICK R B, et al. Object detection networks on convolutional feature maps[J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2016, 39(7):1476-1481.
[39]	BELL S, ZITNICK C L, BALA K, et al. Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks[C]// Proceedings of the 2016 IEEE Confer-ence on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 2874-2883.

Method	Anchor	Pooling sizes	mAP/%
Faster R-CNN	(128,256,512)	7×7	73.2
Faster R-CNN	(64,128,256,512)	7×7	73.3
Cascaded network	(64,128,256,512)	7×7	73.9
Cascaded network+RoIAligns	(64,128,256,512)	3×11、11×3、7×7	74.5
Cascaded network+RoIAligns (Improved R-CNN)	(64,128,256,512)	3×11、11×3、7×7、11×11	74.8
Improved R-CNN+FT	(64,128,256,512)	3×11、11×3、7×7、11×11	75.2
Improved R-CNN+FT+ASDN (Collaborative R-CNN)	(64,128,256,512)	3×11、11×3、7×7、11×11	77.5

Method	Anchor	Pooling sizes	mAP/%
Faster R-CNN	(128,256,512)	7×7	73.2
Faster R-CNN	(64,128,256,512)	7×7	73.3
Cascaded network	(64,128,256,512)	7×7	73.9
Cascaded network+RoIAligns	(64,128,256,512)	3×11、11×3、7×7	74.5
Cascaded network+RoIAligns (Improved R-CNN)	(64,128,256,512)	3×11、11×3、7×7、11×11	74.8
Improved R-CNN+FT	(64,128,256,512)	3×11、11×3、7×7、11×11	75.2
Improved R-CNN+FT+ASDN (Collaborative R-CNN)	(64,128,256,512)	3×11、11×3、7×7、11×11	77.5

Method	Backbone	Train data	Input resolution/pixel	mAP/%
Faster R-CNN^[3]	VGG16	07+12	600×1 000	73.2
A-Fast-RCNN^[25]	VGG16	07+12	600×1 000	71.4
NOC^[38]	VGG16	07+12	600×1 000	73.3
SSD^[15]	VGG16	07+12	—	75.1
RON^[17]	VGG16	07+12	384×384	77.6
ION^[39]	VGG16	07+12	600×1 000	75.6
SIN^[21]	VGG16	07+12	600×1 000	76.0
RGC^[22]	VGG16	07+12	600×1 000	76.1
SSD321^[20]	VGG16	07+12	321×321	77.1
YOLOv3^[16]	DarkNet	07+12	320×320	78.6
CenterNet^[24]	ResNet101	07+12	384×384	78.7
Collaborative R-CNN	VGG16	07+12	600×1 000	77.5

Method	Backbone	Train data	Input resolution/pixel	mAP/%
Faster R-CNN^[3]	VGG16	07+12	600×1 000	73.2
A-Fast-RCNN^[25]	VGG16	07+12	600×1 000	71.4
NOC^[38]	VGG16	07+12	600×1 000	73.3
SSD^[15]	VGG16	07+12	—	75.1
RON^[17]	VGG16	07+12	384×384	77.6
ION^[39]	VGG16	07+12	600×1 000	75.6
SIN^[21]	VGG16	07+12	600×1 000	76.0
RGC^[22]	VGG16	07+12	600×1 000	76.1
SSD321^[20]	VGG16	07+12	321×321	77.1
YOLOv3^[16]	DarkNet	07+12	320×320	78.6
CenterNet^[24]	ResNet101	07+12	384×384	78.7
Collaborative R-CNN	VGG16	07+12	600×1 000	77.5

Object	AP
Object	Faster R-CNN^[3]	A-Fast-RCNN^[25]	SSD^[15]	RON^[17]	CollaborativeR-CNN
aero	84.9	82.2	84.9	86.5	87.0
bike	79.8	75.6	82.6	82.9	83.5
bird	74.3	69.2	74.4	76.6	78.9
blt	53.9	52.0	55.8	60.9	60.1
boat	49.8	47.2	50.0	55.8	57.6
bus	77.5	76.3	80.3	81.7	83.2
car	75.9	71.2	78.9	80.2	80.5
cat	88.5	88.5	88.8	91.1	90.2
chair	45.6	46.8	53.7	57.3	51.6
cow	77.1	74.0	76.8	81.1	82.4
dog	55.3	58.1	59.4	60.4	61.6
hrs	86.9	85.6	87.6	87.2	89.9
mbk	81.7	80.3	83.7	84.8	89.8
per	80.9	80.5	82.6	84.9	82.8
plant	79.6	74.7	81.4	81.7	86.6
shp	40.1	41.5	47.2	51.9	47.4
sofa	72.6	70.4	75.5	79.1	74.2
table	60.9	62.2	65.6	68.6	70.0
train	81.2	77.4	84.3	84.1	86.6
tv	61.5	67.0	68.1	70.3	69.9
mAP	70.4	69.0	73.1	75.4	75.7

协同级联网络和对抗网络的目标检测

Combining Cascaded Network and Adversarial Network for Object Detection

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 39

相关文章 15

编辑推荐

Metrics

Type	Method	Test time/(ms/image)	mAP/%
One-stage	SSD	46	73.1
One-stage	RON	67	75.4
Two-stage	Faster R-CNN	200	70.4
Two-stage	Collaborative R-CNN	234	75.7

[1]	安凤平, 李晓薇, 曹翔. 权重初始化-滑动窗口CNN的医学图像分类[J]. 计算机科学与探索, 2022, 16(8): 1885-1897.
[2]	夏鸿斌, 肖奕飞, 刘渊. 融合自注意力机制的长文本生成对抗网络模型[J]. 计算机科学与探索, 2022, 16(7): 1603-1610.
[3]	彭豪, 李晓明. 多尺度选择金字塔网络的小样本目标检测算法[J]. 计算机科学与探索, 2022, 16(7): 1649-1660.
[4]	孙方伟, 李承阳, 谢永强, 李忠博, 杨才东, 齐锦. 深度学习应用于遮挡目标检测算法综述[J]. 计算机科学与探索, 2022, 16(6): 1243-1259.
[5]	赵运基, 范存良, 张新良. 融合多特征和通道感知的目标跟踪算法[J]. 计算机科学与探索, 2022, 16(6): 1417-1428.
[6]	申瑞彩, 翟俊海, 侯璎真. 选择性集成学习多判别器生成对抗网络[J]. 计算机科学与探索, 2022, 16(6): 1429-1438.
[7]	董文轩, 梁宏涛, 刘国柱, 胡强, 于旭. 深度卷积应用于目标检测算法综述[J]. 计算机科学与探索, 2022, 16(5): 1025-1042.
[8]	林佳伟, 王士同. 用于无监督域适应的深度对抗重构分类网络[J]. 计算机科学与探索, 2022, 16(5): 1107-1116.
[9]	程卫月, 张雪琴, 林克正, 李骜. 融合全局与局部特征的深度卷积神经网络算法[J]. 计算机科学与探索, 2022, 16(5): 1146-1154.
[10]	童敢, 黄立波. Winograd快速卷积相关研究综述[J]. 计算机科学与探索, 2022, 16(5): 959-971.
[11]	裴利沈, 赵雪专. 群体行为识别深度学习方法研究综述[J]. 计算机科学与探索, 2022, 16(4): 775-790.
[12]	赵鹏飞, 谢林柏, 彭力. 融合注意力机制的深层次小目标检测算法[J]. 计算机科学与探索, 2022, 16(4): 927-937.
[13]	伏轩仪, 张銮景, 梁文科, 毕方明, 房卫东. 锚点机制在目标检测领域的发展综述[J]. 计算机科学与探索, 2022, 16(4): 791-805.
[14]	陆仲达, 张春达, 张佳奇, 王子菲, 许军华. 双分支网络的苹果叶部病害识别[J]. 计算机科学与探索, 2022, 16(4): 917-926.
[15]	包广斌, 李港乐, 王国雄. 面向多模态情感分析的双模态交互注意力[J]. 计算机科学与探索, 2022, 16(4): 909-916.