深度学习的人-物体交互检测研究进展

doi:10.3778/j.issn.1673-9418.2106004

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (2): 323-336.DOI: 10.3778/j.issn.1673-9418.2106004

深度学习的人-物体交互检测研究进展

阮晨钊, 张祥森, 刘科, 赵增顺⁺()

山东科技大学电子信息工程学院,山东青岛 266590

收稿日期:2021-06-01 修回日期:2021-08-06 出版日期:2022-02-01 发布日期:2021-08-19
通讯作者: + E-mail: zhaozengshun@163.com
作者简介:阮晨钊（1996—）,男,山东淄博人,硕士研究生,主要研究方向为计算机视觉、图像处理。
张祥森（1997—）,男,山东邹城人,硕士研究生,主要研究方向为深度学习、图像处理。
刘科（1998—）,男,山东青岛人,硕士研究生,主要研究方向为深度学习、图像处理。
赵增顺（1975—）,男,山东滨州人,博士,副教授,主要研究方向为计算机视觉、智能机器人、机器学习。
基金资助:
中国博士后科学基金特别资助项目(2015T80717);山东省自然科学基金(ZR2020MF086)

Progress on Human-Object Interaction Detection of Deep Learning

RUAN Chenzhao, ZHANG Xiangsen, LIU Ke, ZHAO Zengshun⁺()

College of Electronic and Information Engineering, Shandong University of Science and Technology, Qingdao, Shandong 266590, China

Received:2021-06-01 Revised:2021-08-06 Online:2022-02-01 Published:2021-08-19
About author:RUAN Chenzhao, born in 1996, M.S. candidate. His research interests include computer vision and image processing.
ZHANG Xiangsen, born in 1997, M.S. candi-date. His research interests include deep learning and image processing.
LIU Ke, born in 1998, M.S. candidate. His research interests include deep learning and ima-ge processing.
ZHAO Zengshun, born in 1975, Ph.D., associate professor. His research interests include computer vision, intelligent robots and machine learning.
Supported by:
Postdoctoral Science Foundation Funded Project of China(2015T80717);Natural Science Foundation of Shandong Province(ZR2020MF086)

摘要/Abstract

摘要：

人-物体交互检测（HOI）,就是把图像作为输入,检测出图像中存在交互行为的人和物体以及他们之间的交互动词。它是计算机视觉范畴里继目标检测、图像分割和目标跟踪之后又一新任务,旨在对图像进行更深层的理解。针对目前基于深度学习的HOI检测综述性文章的空白,以HOI检测方法的发展历程为主线,对基于深度学习的HOI检测方法进行了分类与分析。首先简要总结了早期的技术方法,然后根据模型结构将现有算法分为两阶段方法和一阶段方法并对一些代表性算法进行分析介绍。将两阶段方法分为融入注意力、图模型以及姿势和身体部位三类进行重点论述,总结了每类方法的基本思想与优缺点。此外,还详细介绍了HOI检测任务的实验评价指标、基准数据集和大多数现有方法的实验结果,对不同类别的方法取得的结果进行说明。最后对该技术面临的主要挑战进行总结分析并对未来发展趋势进行展望。

关键词: 人-物体交互检测（HOI）, 计算机视觉, 目标检测, 深度学习

Abstract:

The task of human-object interaction (HOI) detection takes the image as the input to detect the interaction between people and objects in the image and the interaction verbs between them. It is a new task besides target detection, image segmentation and target tracking in the field of computer vision, in order that the image can be understood deeply. Aiming at filling the gap in the current review article of HOI detection based on deep learning, the methods for HOI detection are classified and analyzed. Firstly, the early methods are summarized briefly, the two-stage methods and one-stage methods are investigated according to the structure of model, and some representative algorithms are analyzed and introduced. The two-stage methods are focused on, which are divided into 3 categories: attention-aware, graph model, posture and body parts. What’s more, the basic ideas, advantages and disadvantages of each type of method are summarized. Besides, the experimental evaluation metrics, the benchmark data sets of HOI detection and the experimental results of most existing methods are introduced in detail and the results obtained by different types of methods are described. Finally, the main challenges of this technology are summarized and the future direction of development is prospected.

Key words: human-object interaction (HOI) detection, computer vision, object detection, deep learning

中图分类号:

TP301

阮晨钊, 张祥森, 刘科, 赵增顺. 深度学习的人-物体交互检测研究进展[J]. 计算机科学与探索, 2022, 16(2): 323-336.

RUAN Chenzhao, ZHANG Xiangsen, LIU Ke, ZHAO Zengshun. Progress on Human-Object Interaction Detection of Deep Learning[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(2): 323-336.

图/表 9

参考文献 68

[1]	CHAO Y W, WANG Z, HE Y, et al. HICO: a benchmark for recognizing human-object interactions in images[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision, Venice, Dec 11-18, 2015. Washington: IEEE Computer Society, 2015: 1017-1025.
[2]	周以重. 人与物体交互行为算法研究与应用[D]. 泉州: 华侨大学, 2019.
	ZHOU Y Z. Investigation and application of human-object interaction detection algorithm[D]. Quanzhou: Huaqiao Uni-versity, 2019.
[3]	惠文珊, 李会军, 陈萌, 等. 基于CNN-LSTM的机器人触觉识别与自适应抓取控制[J]. 仪器仪表学报, 2019, 40(1):211-218.
	HUI W S, LI H J, CHEN M, et al. Robotic tactile recogni-tion and adaptive grasping control based on CNN-LSTM[J]. Chinese Journal of Scientific Instrument, 2019, 40(1):211-218.
[4]	DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, San Diego, Jun 20-26, 2005. Washington: IEEE Com-puter Society, 2005: 886-893.
[5]	LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2):91-110. DOI URL
[6]	GUPTA A, DAVIS L S. Objects in action: an approach for combining action understanding and object perception[C]// Proceedings of the 2007 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition, Min-neapolis, Jun 17-22, 2007. Washington: IEEE Computer Society, 2007: 1-8.
[7]	GUPTA A, KEMBHAVI A, DAVIS L S. Observing human-object interactions: using spatial and functional compatibility for recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10):1775-1789. DOI URL
[8]	YAO B, LI F F. Grouplet: a structured image representation for recognizing human and object interactions[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, Jun 13-18, 2010. Washington: IEEE Computer Society, 2010: 9-16.
[9]	YAO B, LI F F. Modeling mutual context of object and human pose in human-object interaction activities[C]//Proceedings of the 2010 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, San Francisco, Jun 13-18, 2010. Washington: IEEE Computer Society, 2010: 17-24.
[10]	YAO B, JIANG X, KHOSLA A, et al. Human action recogni-tion by learning bases of action attributes and parts[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision, Kathmandu, Nov 6-13, 2011. Washington: IEEE Computer Society, 2011: 1331-1338.
[11]	DELAITRE V, SIVIC J, LAPTEV I. Learning person-object interactions for action recognition in still images[C]//Pro-ceedings of the 25th Annual Conference on Neural Informa-tion Processing Systems, Granada, Dec 12-14, 2011. Red Hook: Curran Associates, 2011: 1503-1511.
[12]	DESAI C, RAMANAN D. Detecting actions, poses, and objects with relational phraselets[C]//LNCS 7575: Proceedings of the 12th European Conference on Computer Vision, Oct 7-13, 2012. Berlin, Heidelberg: Springer, 2012: 158-172.
[13]	HU J F, ZHENG W S, LAI J, et al. Recognising human-object interaction via exemplar based modelling[C]//Pro-ceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Washington: IEEE Computer Society, 2013: 3144-3151.
[14]	GUPTA S, MALIK J. Visual semantic role labeling[J]. arXiv: 1505.04474, 2015.
[15]	CHAO Y W, LIU Y, LIU X, et al. Learning to detect human-object interactions[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, Mar 12-15, 2018. Washington: IEEE Computer Society, 2018: 381-389.
[16]	REN S Q, HE K M, GIRSHICK R B, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(6):1137-1149. DOI URL
[17]	GIRSHICK R B, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and seman-tic segmentation[C]//Proceedings of the 27th IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, Honolulu, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 580-587.
[18]	GIRSHICK R B. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1440-1448.
[19]	SUTSKEVER I, VINYALS O, LE Q V. Sequence to se-quence learning with neural networks[C]//Proceedings of the 28th Annual Conference on Neural Information Pro-cessing Systems, Montreal, Dec 8-13, 2014. Red Hook: Curran Associates, 2014: 3104-3112.
[20]	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]//Proceedings of the 2015 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Wa-shington: IEEE Computer Society, 2015: 3156-3164.
[21]	CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shang-hai, Mar 20-25, 2016. Piscataway: IEEE, 2016: 4960-4964.
[22]	GKIOXARI G, TOSHEV A, JAITLY N. Chained predictions using convolutional neural networks[C]//LNCS 9908: Pro-ceedings of the 14th European Conference on Computer Vision, Oct 11-14, 2016. Cham: Springer, 2016: 728-743.
[23]	GEORGIA G, GIRSHICK R B, DOLLÁR P, et al. Detec-ting and recognizing human-object interactions[C]//Procee-dings of the 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 8359-8367.
[24]	KOLESNIKOV A, KUZNETSOVA A, LAMPERT C H, et al. Detecting visual relationships using box attention[C]//Pro-ceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop, Seoul, Oct 27-28, 2019. Piscataway: IEEE, 2019: 1749-1753.
[25]	GAO C, ZOU Y, HUANG J B. ICAN: instance-centric attention network for human-object interaction detection[J]. arXiv:1808.10437, 2018.
[26]	CHERON G, LAPTEV I, SCHMID C. P-CNN: pose-based CNN features for action recognition[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 3218-3226.
[27]	MALLYA A, LAZEBNIK S. Learning models for actions and person-object interactions with transfer to question answering[C]//LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 414-428.
[28]	GKIOXARI G, GIRSHICK R B, MALIK J. Contextual action recognition with R*CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1080-1088.
[29]	WANG T C, ANWER R M, KHAN M H, et al. Deep contextual attention for human-object interaction detection[C]//Proceedings of the 2019 IEEE International Con-ference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 5694-5702.
[30]	GILMER J, SCHOENHOLZ S S, RILEY P F, et al. Neural message passing for quantum chemistry[C]//Proceedings of the 34th International Conference on Machine Learning, Sydney, Aug 6-11, 2017. New York: ACM, 2017: 1263-1272.
[31]	JAIN A, ZAMIR A R, SAVARESE S, et al. Structural-RNN: deep learning on spatio-temporal graphs[C]//Procee-dings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 5308-5317.
[32]	LI R Y, TAPASWI M, LIAO R J, et al. Situation recogni-tion with graph neural networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 4183-4192.
[33]	MARINO K, SALAKHUTDINOV R, GUPTA A. The more you know: using knowledge graphs for image classification[J]. arXiv:1612.04844, 2016.
[34]	XU D F, ZHU Y K, CHOY C B, et al. Scene graph gene-ration by iterative message passing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3097-3106.
[35]	LIANG X D, SHEN X H, FENG J S, et al. Semantic object parsing with graph LSTM[C]//LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 125-143.
[36]	YUAN Y, LIANG X D, WANG X L, et al. Temporal dyna-mic graph LSTM for action-driven video object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington:IEEE Computer Society, 2017: 1819-1828.
[37]	TENEY D, LIU L Q, VAN DEN HENGEL A. Graph-structured representations for visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3233-3241.
[38]	QI S Y, WANG W G, JIA B X, et al. Learning human-object interactions by graph parsing neural networks[C]//LNCS 11213: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 407-423.
[39]	KOPPULA H S, SAXENA A. Anticipating human activities using object affordances for reactive robotic response[J]. IEEE Transactions on Pattern Analysis and Machine In-telligence, 2015, 38(1):14-29.
[40]	WANG H, ZHENG W S, LING Y B. Contextual hetero-geneous graph network for human-object interaction detec-tion[C]//LNCS 12362: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 248-264.
[41]	吴伟, 刘泽宇. 基于图的人-物交互识别[J]. 计算机工程与应用, 2021, 57(3):175-181.
	WU W, LIU Z Y. Graph-based human-object interactions recognition[J]. Computer Engineering and Applications, 2021, 57(3):175-181.
[42]	LIANG Z J, ROJAS J, LIU J F, et al. Visual-semantic graph attention networks for human-object interaction detection[J]. arXiv:2001.02302, 2020.
[43]	ULUTAN O, IFTEKHAR A S M, MANJUNATH B S. VSGNet: spatial attention network for detecting human object interactions using graph convolutions[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 13614-13623.
[44]	ZHANG F Z, CAMPBELL D, GOULD S. Spatio-attentive graphs for human-object interaction detection[J]. arXiv: 2012.06060, 2020.
[45]	GAO C, XU J R, ZOU Y L, et al. DRG: dual relation graph for human-object interaction detection[C]//LNCS 12357: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 696-712.
[46]	FANG H S, CAO J K, TAI Y W, et al. Pairwise body-part attention for recognizing human-object interactions[C]//LNCS 11214: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 51-67.
[47]	LI Y L, ZHOU S Y, HUANG X J, et al. Transferable interactiveness knowledge for human-object interaction de-tection[C]//Proceedings of the 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 3585-3594.
[48]	WAN B, ZHOU D S, LIU Y F, et al. Pose-aware multi-level feature network for human object interaction detection[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 9468-9477.
[49]	ZHOU P H, CHI M M. Relation parsing neural network for human-object interaction detection[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 843-851.
[50]	LIU H C, MU T J, HUANG X L. Detecting human-object interaction with multi-level pairwise feature network[J]. Computational Visual Media, 2021, 7(2):229-239. DOI URL
[51]	SUN X, HU X W, REN T W, et al. Human object interac-tion detection via multi-level conditioned network[C]//Pro-ceedings of the 2020 International Conference on Multi-media Retrieval, Dublin, Jun 8-11, 2020. New York: ACM, 2020: 26-34.
[52]	LIANG Z J, LIU J F, GUAN Y S, et al. Pose-based modular network for human-object interaction detection[J]. arXiv: 2008.02042, 2020.
[53]	LIAO Y, LIU S, WANG F, et al. PPDM: parallel point detection and matching for real-time human-object interac-tion detection[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Washington: IEEE Computer Society, 2020: 479-487.
[54]	WANG T, YANG T, MARTIN D, et al. Learning human-object interaction detection using interaction points[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Washington: IEEE Computer Society, 2020: 4116-4125.
[55]	KIM B, CHOI T, KANG J, et al. UnionDet: union-level detector towards real-time human-object interaction detec-tion[C]//LNCS 12360: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 498-514.
[56]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 21-37.
[57]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 2999-3007.
[58]	ZHOU P, NI B, GENG C, et al. Scale-transferrable object detection[C]//Proceedings of the 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 528-537.
[59]	CHEN M, LIAO Y, LIU S, et al. Reformulating HOI detection as adaptive set prediction[J]. arXiv:2103.05983, 2021.
[60]	LIN T Y, MAIRE M, BELONGIE S J, et al. Microsoft COCO: common objects in context[C]//LNCS 8693: Procee-dings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 740-755.
[61]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washing-ton: IEEE Computer Society, 2016: 770-778.
[62]	LIN T Y, DOLLÁR P, GIRSHICK R B, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 936-944.
[63]	DAI J F, QI H Z, XIONG Y W, et al. Deformable convolu-tional networks[C]//Proceedings of the 2017 IEEE Interna-tional Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 764-773.
[64]	JIA Y Q, SHELHAMER E, DONAHUE J, et al. Caffe: convolutional architecture for fast feature embedding[C]//Proceedings of the 2014 ACM Conference on Multimedia, Orlando, Nov 3-7, 2014. New York: ACM, 2014: 675-678.
[65]	NEWELL A, YANG K Y, JIA D. Stacked hourglass net-works for human pose estimation[C]//LNCS 9912: Procee-dings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 483-499.
[66]	SHEN L Y, YEUNG S, HOFFMAN J, et al. Scaling human-object interaction recognition through zero-shot learning[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, Mar 12-15, 2018. Washington: IEEE Computer Society, 2018: 1568-1576.
[67]	JI Z, LIU X Y, PANG Y W, et al. Few-shot human-object interaction recognition with semantic-guided attentive proto-types network[J]. IEEE Transactions on Image Processing, 2020, 30:1648-1661. DOI URL
[68]	LIU X Y, JI Z, PANG Y W, et al. DGIG-Net: dynamic graph-in-graph networks for few-shot human-object interac-tion[J]. IEEE Transactions on Cybernetics, 2021: 1-13. DOI: 10.1109/TCYB.2021.3049537. DOI

编辑推荐 0

Metrics

阅读次数

全文

481

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	31	28	0	422

来源	本网站	其他网站

次数	414	67
比例	86%	14%

摘要

565

最新录用	在线预览	正式出版

8	0	557

来源	本网站	其他网站

次数	564	1
比例	100%	0%

分类	子类	代表性工作	优点	局限	适用场景
两阶段方法	融入注意力	BAR-CNN^[24] ICAN^[25] Wang^[40]	能够有效提取上下文信息,准确率相较HO-RCNN有很大提高	除视觉信息和空间信息外并没有额外信息的引入,准确率有待提高	适用于训练样本充足,硬件算力高,对实时性要求较低的场景
	融入图模型	GPNN^[38] Wu^[41] VS-GATs^[42] VSGNet ^[43] SAG ^[44] DRG^[45]	可同时预测图像中的所有交互对,能够消除配对歧义	鲜有视觉和空间信息外的额外信息的引入来帮助构建图模型,对硬件要求高
	融入身体部位和姿态	TIN^[47] PMFNet^[48] RPNN^[49] PFNet^[50] MLCNet^[51] PMN^[52]	有效整合人的身体姿势或身体部分信息,准确率相对较高	计算量大且费时,对硬件要求高
一阶段方法	—	PPDM^[53] IP-Net^[54] UnionDet^[55] AS-Net^[59]	检测速度快,准确率高,易于部署	模型的构建与训练较为复杂	适用于对实时性、准确率要求较高的场景

分类	子类	代表性工作	优点	局限	适用场景
两阶段方法	融入注意力	BAR-CNN^[24] ICAN^[25] Wang^[40]	能够有效提取上下文信息,准确率相较HO-RCNN有很大提高	除视觉信息和空间信息外并没有额外信息的引入,准确率有待提高	适用于训练样本充足,硬件算力高,对实时性要求较低的场景
	融入图模型	GPNN^[38] Wu^[41] VS-GATs^[42] VSGNet ^[43] SAG ^[44] DRG^[45]	可同时预测图像中的所有交互对,能够消除配对歧义	鲜有视觉和空间信息外的额外信息的引入来帮助构建图模型,对硬件要求高
	融入身体部位和姿态	TIN^[47] PMFNet^[48] RPNN^[49] PFNet^[50] MLCNet^[51] PMN^[52]	有效整合人的身体姿势或身体部分信息,准确率相对较高	计算量大且费时,对硬件要求高
一阶段方法	—	PPDM^[53] IP-Net^[54] UnionDet^[55] AS-Net^[59]	检测速度快,准确率高,易于部署	模型的构建与训练较为复杂	适用于对实时性、准确率要求较高的场景

真实情况	预测结果
真实情况	Positive	Negative
True	TP（真正例）	FN（假反例）
False	FP（假正例）	TN（真反例）

真实情况	预测结果
真实情况	Positive	Negative
True	TP（真正例）	FN（假反例）
False	FP（假正例）	TN（真反例）

Method	Backbone	mAP/%
Gupta^[13]	ResNet-50-FPN	31.8
InteractNet^[23]	ResNet-50-FPN	40.0
BAR-CNN^[24]	Inception-ResNet	41.1
iCAN^[25]	ResNet-50	45.3
Wang^[29]	ResNet-50	47.3
GPNN^[38]	Res-DCN-152	44.0
Wang^[40]	ResNet-50-FPN	52.7
Wu^[41]	VGG-16	44.6
VS-GATs^[42]	ResNet-50-FPN	50.6
VSGNet^[43]	ResNet-152	51.8
DRG^[45]	ResNet-50-FPN	51.0
TIN^[47]	ResNet-50	47.8
PMFNet^[48]	ResNet-50-FPN	52.0
RPNN^[49]	ResNet-50	47.5
PFNet^[50]	ResNet-50	52.8
MLCNet^[51]	ResNet-50-FPN	55.2
VS-GATs+PMN^[52]	ResNet-50-FPN	51.8
IP-Net^[54]	Hourglass-104	51.0
UnionDet^[55]	ResNet-50-FPN	47.5
AS-Net^[59]	ResNet-50	53.9

深度学习的人-物体交互检测研究进展

Progress on Human-Object Interaction Detection of Deep Learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 68

相关文章 15

编辑推荐 0

Metrics

Method	Backbone	Default			Known Object
Method	Backbone	full	rare	non-rare	full	rare	non-rare
HO-RCNN^[15]	CaffeNet	7.81	5.37	8.54	10.41	8.94	10.85
InteractNet^[23]	ResNet-50-FPN	9.94	7.16	10.77	—	—	—
iCAN^[25]	ResNet-50	14.84	10.45	16.15	16.26	11.33	17.73
Wang^[29]	ResNet-50	16.24	11.16	17.75	17.33	12.78	19.21
GPNN^[38]	Res-DCN-152	13.11	9.34	14.23	—	—	—
Wang^[40]	ResNet-50-FPN	17.57	16.85	17.78	21.00	20.74	21.08
Wu^[41]	VGG-16	13.55	9.62	15.20	—	—	—
VS-GATs^[42]	ResNet-50-FPN	20.27	16.03	21.54	—	—	—
VSGNet^[43]	ResNet-152	19.80	16.05	20.91	—	—	—
SAG^[44]	ResNet-50-FPN	18.26	13.40	19.71	—	—	—
DRG^[45]	ResNet-50-FPN	19.26	17.74	19.71	23.40	21.75	23.89
TIN^[47]	ResNet-50	17.03	13.42	18.11	19.17	15.51	20.26
PMFNet^[48]	ResNet-50-FPN	17.46	15.65	18.00	20.34	17.47	21.20
RPNN^[49]	ResNet-50	17.35	12.78	18.71	—	—	—
PFNet^[50]	ResNet-50	20.05	16.66	21.07	24.01	21.09	24.89
MLCNet^[51]	ResNet-50-FPN	17.95	16.62	18.35	22.28	20.73	22.74
VS-GATs+PMN^[52]	ResNet-50-FPN	21.21	17.60	22.29	—	—	—
PPDM^[53]	Hourglass-104	21.73	13.78	24.10	24.58	16.65	26.84
IP-Net^[54]	Hourglass-104	19.56	12.79	21.58	22.05	15.77	23.92
UnionDet^[55]	ResNet-50-FPN	17.58	11.72	19.33	19.76	14.68	21.27
AS-Net^[59]	ResNet-50	28.87	24.25	30.25	31.74	27.07	33.14

[1]	安凤平, 李晓薇, 曹翔. 权重初始化-滑动窗口CNN的医学图像分类[J]. 计算机科学与探索, 2022, 16(8): 1885-1897.
[2]	曾凡智, 许露倩, 周燕, 周月霞, 廖俊玮. 面向智慧教育的知识追踪模型研究综述[J]. 计算机科学与探索, 2022, 16(8): 1742-1763.
[3]	刘艺, 李蒙蒙, 郑奇斌, 秦伟, 任小广. 视频目标跟踪算法综述[J]. 计算机科学与探索, 2022, 16(7): 1504-1515.
[4]	赵小明, 杨轶娇, 张石清. 面向深度学习的多模态情感识别研究进展[J]. 计算机科学与探索, 2022, 16(7): 1479-1503.
[5]	夏鸿斌, 肖奕飞, 刘渊. 融合自注意力机制的长文本生成对抗网络模型[J]. 计算机科学与探索, 2022, 16(7): 1603-1610.
[6]	彭豪, 李晓明. 多尺度选择金字塔网络的小样本目标检测算法[J]. 计算机科学与探索, 2022, 16(7): 1649-1660.
[7]	张好聪, 李涛, 邢立冬, 潘风蕊. OpenVX特征抽取函数在可编程并行架构的实现[J]. 计算机科学与探索, 2022, 16(7): 1583-1593.
[8]	孙方伟, 李承阳, 谢永强, 李忠博, 杨才东, 齐锦. 深度学习应用于遮挡目标检测算法综述[J]. 计算机科学与探索, 2022, 16(6): 1243-1259.
[9]	刘雅芬, 郑艺峰, 江铃燚, 李国和, 张文杰. 深度半监督学习中伪标签方法综述[J]. 计算机科学与探索, 2022, 16(6): 1279-1290.
[10]	董文轩, 梁宏涛, 刘国柱, 胡强, 于旭. 深度卷积应用于目标检测算法综述[J]. 计算机科学与探索, 2022, 16(5): 1025-1042.
[11]	程卫月, 张雪琴, 林克正, 李骜. 融合全局与局部特征的深度卷积神经网络算法[J]. 计算机科学与探索, 2022, 16(5): 1146-1154.
[12]	钟梦圆, 姜麟. 超分辨率图像重建算法综述[J]. 计算机科学与探索, 2022, 16(5): 972-990.
[13]	伏轩仪, 张銮景, 梁文科, 毕方明, 房卫东. 锚点机制在目标检测领域的发展综述[J]. 计算机科学与探索, 2022, 16(4): 791-805.
[14]	赵鹏飞, 谢林柏, 彭力. 融合注意力机制的深层次小目标检测算法[J]. 计算机科学与探索, 2022, 16(4): 927-937.
[15]	裴利沈, 赵雪专. 群体行为识别深度学习方法研究综述[J]. 计算机科学与探索, 2022, 16(4): 775-790.