Progress on Human-Object Interaction Detection of Deep Learning

doi:10.3778/j.issn.1673-9418.2106004

Journal of Frontiers of Computer Science and Technology ›› 2022, Vol. 16 ›› Issue (2): 323-336.DOI: 10.3778/j.issn.1673-9418.2106004

• Surveys and Frontiers • Previous Articles Next Articles

Progress on Human-Object Interaction Detection of Deep Learning

RUAN Chenzhao, ZHANG Xiangsen, LIU Ke, ZHAO Zengshun⁺()

College of Electronic and Information Engineering, Shandong University of Science and Technology, Qingdao, Shandong 266590, China

Received:2021-06-01 Revised:2021-08-06 Online:2022-02-01 Published:2021-08-19
About author:RUAN Chenzhao, born in 1996, M.S. candidate. His research interests include computer vision and image processing.
ZHANG Xiangsen, born in 1997, M.S. candi-date. His research interests include deep learning and image processing.
LIU Ke, born in 1998, M.S. candidate. His research interests include deep learning and ima-ge processing.
ZHAO Zengshun, born in 1975, Ph.D., associate professor. His research interests include computer vision, intelligent robots and machine learning.
Supported by:
Postdoctoral Science Foundation Funded Project of China(2015T80717);Natural Science Foundation of Shandong Province(ZR2020MF086)

深度学习的人-物体交互检测研究进展

阮晨钊, 张祥森, 刘科, 赵增顺⁺()

山东科技大学电子信息工程学院,山东青岛 266590

通讯作者: + E-mail: zhaozengshun@163.com
作者简介:阮晨钊（1996—）,男,山东淄博人,硕士研究生,主要研究方向为计算机视觉、图像处理。
张祥森（1997—）,男,山东邹城人,硕士研究生,主要研究方向为深度学习、图像处理。
刘科（1998—）,男,山东青岛人,硕士研究生,主要研究方向为深度学习、图像处理。
赵增顺（1975—）,男,山东滨州人,博士,副教授,主要研究方向为计算机视觉、智能机器人、机器学习。
基金资助:
中国博士后科学基金特别资助项目(2015T80717);山东省自然科学基金(ZR2020MF086)

Abstract

Abstract:

The task of human-object interaction (HOI) detection takes the image as the input to detect the interaction between people and objects in the image and the interaction verbs between them. It is a new task besides target detection, image segmentation and target tracking in the field of computer vision, in order that the image can be understood deeply. Aiming at filling the gap in the current review article of HOI detection based on deep learning, the methods for HOI detection are classified and analyzed. Firstly, the early methods are summarized briefly, the two-stage methods and one-stage methods are investigated according to the structure of model, and some representative algorithms are analyzed and introduced. The two-stage methods are focused on, which are divided into 3 categories: attention-aware, graph model, posture and body parts. What’s more, the basic ideas, advantages and disadvantages of each type of method are summarized. Besides, the experimental evaluation metrics, the benchmark data sets of HOI detection and the experimental results of most existing methods are introduced in detail and the results obtained by different types of methods are described. Finally, the main challenges of this technology are summarized and the future direction of development is prospected.

Key words: human-object interaction (HOI) detection, computer vision, object detection, deep learning

摘要：

人-物体交互检测（HOI）,就是把图像作为输入,检测出图像中存在交互行为的人和物体以及他们之间的交互动词。它是计算机视觉范畴里继目标检测、图像分割和目标跟踪之后又一新任务,旨在对图像进行更深层的理解。针对目前基于深度学习的HOI检测综述性文章的空白,以HOI检测方法的发展历程为主线,对基于深度学习的HOI检测方法进行了分类与分析。首先简要总结了早期的技术方法,然后根据模型结构将现有算法分为两阶段方法和一阶段方法并对一些代表性算法进行分析介绍。将两阶段方法分为融入注意力、图模型以及姿势和身体部位三类进行重点论述,总结了每类方法的基本思想与优缺点。此外,还详细介绍了HOI检测任务的实验评价指标、基准数据集和大多数现有方法的实验结果,对不同类别的方法取得的结果进行说明。最后对该技术面临的主要挑战进行总结分析并对未来发展趋势进行展望。

关键词: 人-物体交互检测（HOI）, 计算机视觉, 目标检测, 深度学习

CLC Number:

TP301

RUAN Chenzhao, ZHANG Xiangsen, LIU Ke, ZHAO Zengshun. Progress on Human-Object Interaction Detection of Deep Learning[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(2): 323-336.

阮晨钊, 张祥森, 刘科, 赵增顺. 深度学习的人-物体交互检测研究进展[J]. 计算机科学与探索, 2022, 16(2): 323-336.

Figures/Tables 9

References 68

[1]	CHAO Y W, WANG Z, HE Y, et al. HICO: a benchmark for recognizing human-object interactions in images[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision, Venice, Dec 11-18, 2015. Washington: IEEE Computer Society, 2015: 1017-1025.
[2]	周以重. 人与物体交互行为算法研究与应用[D]. 泉州: 华侨大学, 2019.
	ZHOU Y Z. Investigation and application of human-object interaction detection algorithm[D]. Quanzhou: Huaqiao Uni-versity, 2019.
[3]	惠文珊, 李会军, 陈萌, 等. 基于CNN-LSTM的机器人触觉识别与自适应抓取控制[J]. 仪器仪表学报, 2019, 40(1):211-218.
	HUI W S, LI H J, CHEN M, et al. Robotic tactile recogni-tion and adaptive grasping control based on CNN-LSTM[J]. Chinese Journal of Scientific Instrument, 2019, 40(1):211-218.
[4]	DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, San Diego, Jun 20-26, 2005. Washington: IEEE Com-puter Society, 2005: 886-893.
[5]	LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2):91-110. DOI URL
[6]	GUPTA A, DAVIS L S. Objects in action: an approach for combining action understanding and object perception[C]// Proceedings of the 2007 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition, Min-neapolis, Jun 17-22, 2007. Washington: IEEE Computer Society, 2007: 1-8.
[7]	GUPTA A, KEMBHAVI A, DAVIS L S. Observing human-object interactions: using spatial and functional compatibility for recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10):1775-1789. DOI URL
[8]	YAO B, LI F F. Grouplet: a structured image representation for recognizing human and object interactions[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, Jun 13-18, 2010. Washington: IEEE Computer Society, 2010: 9-16.
[9]	YAO B, LI F F. Modeling mutual context of object and human pose in human-object interaction activities[C]//Proceedings of the 2010 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, San Francisco, Jun 13-18, 2010. Washington: IEEE Computer Society, 2010: 17-24.
[10]	YAO B, JIANG X, KHOSLA A, et al. Human action recogni-tion by learning bases of action attributes and parts[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision, Kathmandu, Nov 6-13, 2011. Washington: IEEE Computer Society, 2011: 1331-1338.
[11]	DELAITRE V, SIVIC J, LAPTEV I. Learning person-object interactions for action recognition in still images[C]//Pro-ceedings of the 25th Annual Conference on Neural Informa-tion Processing Systems, Granada, Dec 12-14, 2011. Red Hook: Curran Associates, 2011: 1503-1511.
[12]	DESAI C, RAMANAN D. Detecting actions, poses, and objects with relational phraselets[C]//LNCS 7575: Proceedings of the 12th European Conference on Computer Vision, Oct 7-13, 2012. Berlin, Heidelberg: Springer, 2012: 158-172.
[13]	HU J F, ZHENG W S, LAI J, et al. Recognising human-object interaction via exemplar based modelling[C]//Pro-ceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Washington: IEEE Computer Society, 2013: 3144-3151.
[14]	GUPTA S, MALIK J. Visual semantic role labeling[J]. arXiv: 1505.04474, 2015.
[15]	CHAO Y W, LIU Y, LIU X, et al. Learning to detect human-object interactions[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, Mar 12-15, 2018. Washington: IEEE Computer Society, 2018: 381-389.
[16]	REN S Q, HE K M, GIRSHICK R B, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(6):1137-1149. DOI URL
[17]	GIRSHICK R B, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and seman-tic segmentation[C]//Proceedings of the 27th IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, Honolulu, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 580-587.
[18]	GIRSHICK R B. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1440-1448.
[19]	SUTSKEVER I, VINYALS O, LE Q V. Sequence to se-quence learning with neural networks[C]//Proceedings of the 28th Annual Conference on Neural Information Pro-cessing Systems, Montreal, Dec 8-13, 2014. Red Hook: Curran Associates, 2014: 3104-3112.
[20]	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]//Proceedings of the 2015 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Wa-shington: IEEE Computer Society, 2015: 3156-3164.
[21]	CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shang-hai, Mar 20-25, 2016. Piscataway: IEEE, 2016: 4960-4964.
[22]	GKIOXARI G, TOSHEV A, JAITLY N. Chained predictions using convolutional neural networks[C]//LNCS 9908: Pro-ceedings of the 14th European Conference on Computer Vision, Oct 11-14, 2016. Cham: Springer, 2016: 728-743.
[23]	GEORGIA G, GIRSHICK R B, DOLLÁR P, et al. Detec-ting and recognizing human-object interactions[C]//Procee-dings of the 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 8359-8367.
[24]	KOLESNIKOV A, KUZNETSOVA A, LAMPERT C H, et al. Detecting visual relationships using box attention[C]//Pro-ceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop, Seoul, Oct 27-28, 2019. Piscataway: IEEE, 2019: 1749-1753.
[25]	GAO C, ZOU Y, HUANG J B. ICAN: instance-centric attention network for human-object interaction detection[J]. arXiv:1808.10437, 2018.
[26]	CHERON G, LAPTEV I, SCHMID C. P-CNN: pose-based CNN features for action recognition[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 3218-3226.
[27]	MALLYA A, LAZEBNIK S. Learning models for actions and person-object interactions with transfer to question answering[C]//LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 414-428.
[28]	GKIOXARI G, GIRSHICK R B, MALIK J. Contextual action recognition with R*CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1080-1088.
[29]	WANG T C, ANWER R M, KHAN M H, et al. Deep contextual attention for human-object interaction detection[C]//Proceedings of the 2019 IEEE International Con-ference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 5694-5702.
[30]	GILMER J, SCHOENHOLZ S S, RILEY P F, et al. Neural message passing for quantum chemistry[C]//Proceedings of the 34th International Conference on Machine Learning, Sydney, Aug 6-11, 2017. New York: ACM, 2017: 1263-1272.
[31]	JAIN A, ZAMIR A R, SAVARESE S, et al. Structural-RNN: deep learning on spatio-temporal graphs[C]//Procee-dings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 5308-5317.
[32]	LI R Y, TAPASWI M, LIAO R J, et al. Situation recogni-tion with graph neural networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 4183-4192.
[33]	MARINO K, SALAKHUTDINOV R, GUPTA A. The more you know: using knowledge graphs for image classification[J]. arXiv:1612.04844, 2016.
[34]	XU D F, ZHU Y K, CHOY C B, et al. Scene graph gene-ration by iterative message passing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3097-3106.
[35]	LIANG X D, SHEN X H, FENG J S, et al. Semantic object parsing with graph LSTM[C]//LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 125-143.
[36]	YUAN Y, LIANG X D, WANG X L, et al. Temporal dyna-mic graph LSTM for action-driven video object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington:IEEE Computer Society, 2017: 1819-1828.
[37]	TENEY D, LIU L Q, VAN DEN HENGEL A. Graph-structured representations for visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3233-3241.
[38]	QI S Y, WANG W G, JIA B X, et al. Learning human-object interactions by graph parsing neural networks[C]//LNCS 11213: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 407-423.
[39]	KOPPULA H S, SAXENA A. Anticipating human activities using object affordances for reactive robotic response[J]. IEEE Transactions on Pattern Analysis and Machine In-telligence, 2015, 38(1):14-29.
[40]	WANG H, ZHENG W S, LING Y B. Contextual hetero-geneous graph network for human-object interaction detec-tion[C]//LNCS 12362: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 248-264.
[41]	吴伟, 刘泽宇. 基于图的人-物交互识别[J]. 计算机工程与应用, 2021, 57(3):175-181.
	WU W, LIU Z Y. Graph-based human-object interactions recognition[J]. Computer Engineering and Applications, 2021, 57(3):175-181.
[42]	LIANG Z J, ROJAS J, LIU J F, et al. Visual-semantic graph attention networks for human-object interaction detection[J]. arXiv:2001.02302, 2020.
[43]	ULUTAN O, IFTEKHAR A S M, MANJUNATH B S. VSGNet: spatial attention network for detecting human object interactions using graph convolutions[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 13614-13623.
[44]	ZHANG F Z, CAMPBELL D, GOULD S. Spatio-attentive graphs for human-object interaction detection[J]. arXiv: 2012.06060, 2020.
[45]	GAO C, XU J R, ZOU Y L, et al. DRG: dual relation graph for human-object interaction detection[C]//LNCS 12357: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 696-712.
[46]	FANG H S, CAO J K, TAI Y W, et al. Pairwise body-part attention for recognizing human-object interactions[C]//LNCS 11214: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 51-67.
[47]	LI Y L, ZHOU S Y, HUANG X J, et al. Transferable interactiveness knowledge for human-object interaction de-tection[C]//Proceedings of the 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 3585-3594.
[48]	WAN B, ZHOU D S, LIU Y F, et al. Pose-aware multi-level feature network for human object interaction detection[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 9468-9477.
[49]	ZHOU P H, CHI M M. Relation parsing neural network for human-object interaction detection[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 843-851.
[50]	LIU H C, MU T J, HUANG X L. Detecting human-object interaction with multi-level pairwise feature network[J]. Computational Visual Media, 2021, 7(2):229-239. DOI URL
[51]	SUN X, HU X W, REN T W, et al. Human object interac-tion detection via multi-level conditioned network[C]//Pro-ceedings of the 2020 International Conference on Multi-media Retrieval, Dublin, Jun 8-11, 2020. New York: ACM, 2020: 26-34.
[52]	LIANG Z J, LIU J F, GUAN Y S, et al. Pose-based modular network for human-object interaction detection[J]. arXiv: 2008.02042, 2020.
[53]	LIAO Y, LIU S, WANG F, et al. PPDM: parallel point detection and matching for real-time human-object interac-tion detection[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Washington: IEEE Computer Society, 2020: 479-487.
[54]	WANG T, YANG T, MARTIN D, et al. Learning human-object interaction detection using interaction points[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Washington: IEEE Computer Society, 2020: 4116-4125.
[55]	KIM B, CHOI T, KANG J, et al. UnionDet: union-level detector towards real-time human-object interaction detec-tion[C]//LNCS 12360: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 498-514.
[56]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 21-37.
[57]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 2999-3007.
[58]	ZHOU P, NI B, GENG C, et al. Scale-transferrable object detection[C]//Proceedings of the 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 528-537.
[59]	CHEN M, LIAO Y, LIU S, et al. Reformulating HOI detection as adaptive set prediction[J]. arXiv:2103.05983, 2021.
[60]	LIN T Y, MAIRE M, BELONGIE S J, et al. Microsoft COCO: common objects in context[C]//LNCS 8693: Procee-dings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 740-755.
[61]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washing-ton: IEEE Computer Society, 2016: 770-778.
[62]	LIN T Y, DOLLÁR P, GIRSHICK R B, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 936-944.
[63]	DAI J F, QI H Z, XIONG Y W, et al. Deformable convolu-tional networks[C]//Proceedings of the 2017 IEEE Interna-tional Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 764-773.
[64]	JIA Y Q, SHELHAMER E, DONAHUE J, et al. Caffe: convolutional architecture for fast feature embedding[C]//Proceedings of the 2014 ACM Conference on Multimedia, Orlando, Nov 3-7, 2014. New York: ACM, 2014: 675-678.
[65]	NEWELL A, YANG K Y, JIA D. Stacked hourglass net-works for human pose estimation[C]//LNCS 9912: Procee-dings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 483-499.
[66]	SHEN L Y, YEUNG S, HOFFMAN J, et al. Scaling human-object interaction recognition through zero-shot learning[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, Mar 12-15, 2018. Washington: IEEE Computer Society, 2018: 1568-1576.
[67]	JI Z, LIU X Y, PANG Y W, et al. Few-shot human-object interaction recognition with semantic-guided attentive proto-types network[J]. IEEE Transactions on Image Processing, 2020, 30:1648-1661. DOI URL
[68]	LIU X Y, JI Z, PANG Y W, et al. DGIG-Net: dynamic graph-in-graph networks for few-shot human-object interac-tion[J]. IEEE Transactions on Cybernetics, 2021: 1-13. DOI: 10.1109/TCYB.2021.3049537. DOI

分类	子类	代表性工作	优点	局限	适用场景
两阶段方法	融入注意力	BAR-CNN^[24] ICAN^[25] Wang^[40]	能够有效提取上下文信息,准确率相较HO-RCNN有很大提高	除视觉信息和空间信息外并没有额外信息的引入,准确率有待提高	适用于训练样本充足,硬件算力高,对实时性要求较低的场景
	融入图模型	GPNN^[38] Wu^[41] VS-GATs^[42] VSGNet ^[43] SAG ^[44] DRG^[45]	可同时预测图像中的所有交互对,能够消除配对歧义	鲜有视觉和空间信息外的额外信息的引入来帮助构建图模型,对硬件要求高
	融入身体部位和姿态	TIN^[47] PMFNet^[48] RPNN^[49] PFNet^[50] MLCNet^[51] PMN^[52]	有效整合人的身体姿势或身体部分信息,准确率相对较高	计算量大且费时,对硬件要求高
一阶段方法	—	PPDM^[53] IP-Net^[54] UnionDet^[55] AS-Net^[59]	检测速度快,准确率高,易于部署	模型的构建与训练较为复杂	适用于对实时性、准确率要求较高的场景

分类	子类	代表性工作	优点	局限	适用场景
两阶段方法	融入注意力	BAR-CNN^[24] ICAN^[25] Wang^[40]	能够有效提取上下文信息,准确率相较HO-RCNN有很大提高	除视觉信息和空间信息外并没有额外信息的引入,准确率有待提高	适用于训练样本充足,硬件算力高,对实时性要求较低的场景
	融入图模型	GPNN^[38] Wu^[41] VS-GATs^[42] VSGNet ^[43] SAG ^[44] DRG^[45]	可同时预测图像中的所有交互对,能够消除配对歧义	鲜有视觉和空间信息外的额外信息的引入来帮助构建图模型,对硬件要求高
	融入身体部位和姿态	TIN^[47] PMFNet^[48] RPNN^[49] PFNet^[50] MLCNet^[51] PMN^[52]	有效整合人的身体姿势或身体部分信息,准确率相对较高	计算量大且费时,对硬件要求高
一阶段方法	—	PPDM^[53] IP-Net^[54] UnionDet^[55] AS-Net^[59]	检测速度快,准确率高,易于部署	模型的构建与训练较为复杂	适用于对实时性、准确率要求较高的场景

真实情况	预测结果
真实情况	Positive	Negative
True	TP（真正例）	FN（假反例）
False	FP（假正例）	TN（真反例）

真实情况	预测结果
真实情况	Positive	Negative
True	TP（真正例）	FN（假反例）
False	FP（假正例）	TN（真反例）

Method	Backbone	mAP/%
Gupta^[13]	ResNet-50-FPN	31.8
InteractNet^[23]	ResNet-50-FPN	40.0
BAR-CNN^[24]	Inception-ResNet	41.1
iCAN^[25]	ResNet-50	45.3
Wang^[29]	ResNet-50	47.3
GPNN^[38]	Res-DCN-152	44.0
Wang^[40]	ResNet-50-FPN	52.7
Wu^[41]	VGG-16	44.6
VS-GATs^[42]	ResNet-50-FPN	50.6
VSGNet^[43]	ResNet-152	51.8
DRG^[45]	ResNet-50-FPN	51.0
TIN^[47]	ResNet-50	47.8
PMFNet^[48]	ResNet-50-FPN	52.0
RPNN^[49]	ResNet-50	47.5
PFNet^[50]	ResNet-50	52.8
MLCNet^[51]	ResNet-50-FPN	55.2
VS-GATs+PMN^[52]	ResNet-50-FPN	51.8
IP-Net^[54]	Hourglass-104	51.0
UnionDet^[55]	ResNet-50-FPN	47.5
AS-Net^[59]	ResNet-50	53.9

Progress on Human-Object Interaction Detection of Deep Learning

深度学习的人-物体交互检测研究进展

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 9

References 68

Related Articles 15

Recommended Articles 0

Metrics

Method	Backbone	Default			Known Object
Method	Backbone	full	rare	non-rare	full	rare	non-rare
HO-RCNN^[15]	CaffeNet	7.81	5.37	8.54	10.41	8.94	10.85
InteractNet^[23]	ResNet-50-FPN	9.94	7.16	10.77	—	—	—
iCAN^[25]	ResNet-50	14.84	10.45	16.15	16.26	11.33	17.73
Wang^[29]	ResNet-50	16.24	11.16	17.75	17.33	12.78	19.21
GPNN^[38]	Res-DCN-152	13.11	9.34	14.23	—	—	—
Wang^[40]	ResNet-50-FPN	17.57	16.85	17.78	21.00	20.74	21.08
Wu^[41]	VGG-16	13.55	9.62	15.20	—	—	—
VS-GATs^[42]	ResNet-50-FPN	20.27	16.03	21.54	—	—	—
VSGNet^[43]	ResNet-152	19.80	16.05	20.91	—	—	—
SAG^[44]	ResNet-50-FPN	18.26	13.40	19.71	—	—	—
DRG^[45]	ResNet-50-FPN	19.26	17.74	19.71	23.40	21.75	23.89
TIN^[47]	ResNet-50	17.03	13.42	18.11	19.17	15.51	20.26
PMFNet^[48]	ResNet-50-FPN	17.46	15.65	18.00	20.34	17.47	21.20
RPNN^[49]	ResNet-50	17.35	12.78	18.71	—	—	—
PFNet^[50]	ResNet-50	20.05	16.66	21.07	24.01	21.09	24.89
MLCNet^[51]	ResNet-50-FPN	17.95	16.62	18.35	22.28	20.73	22.74
VS-GATs+PMN^[52]	ResNet-50-FPN	21.21	17.60	22.29	—	—	—
PPDM^[53]	Hourglass-104	21.73	13.78	24.10	24.58	16.65	26.84
IP-Net^[54]	Hourglass-104	19.56	12.79	21.58	22.05	15.77	23.92
UnionDet^[55]	ResNet-50-FPN	17.58	11.72	19.33	19.76	14.68	21.27
AS-Net^[59]	ResNet-50	28.87	24.25	30.25	31.74	27.07	33.14

[1]	AN Fengping, LI Xiaowei, CAO Xiang. Medical Image Classification Algorithm Based on Weight Initialization-Sliding Window CNN [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1885-1897.
[2]	ZENG Fanzhi, XU Luqian, ZHOU Yan, ZHOU Yuexia, LIAO Junwei. Review of Knowledge Tracing Model for Intelligent Education [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1742-1763.
[3]	LIU Yi, LI Mengmeng, ZHENG Qibin, QIN Wei, REN Xiaoguang. Survey on Video Object Tracking Algorithms [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1504-1515.
[4]	ZHAO Xiaoming, YANG Yijiao, ZHANG Shiqing. Survey of Deep Learning Based Multimodal Emotion Recognition [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1479-1503.
[5]	XIA Hongbin, XIAO Yifei, LIU Yuan. Long Text Generation Adversarial Network Model with Self-Attention Mechanism [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1603-1610.
[6]	ZHANG Haocong, LI Tao, XING Lidong, PAN Fengrui. Parallel Implementation of OpenVX Feature Extraction Functions in Programmable Processing Architecture [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1583-1593.
[7]	SUN Fangwei, LI Chengyang, XIE Yongqiang, LI Zhongbo, YANG Caidong, QI Jin. Review of Deep Learning Applied to Occluded Object Detection [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1243-1259.
[8]	LIU Yafen, ZHENG Yifeng, JIANG Lingyi, LI Guohe, ZHANG Wenjie. Survey on Pseudo-Labeling Methods in Deep Semi-supervised Learning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1279-1290.
[9]	DONG Wenxuan, LIANG Hongtao, LIU Guozhu, HU Qiang, YU Xu. Review of Deep Convolution Applied to Target Detection Algorithms [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 1025-1042.
[10]	CHENG Weiyue, ZHANG Xueqin, LIN Kezheng, LI Ao. Deep Convolutional Neural Network Algorithm Fusing Global and Local Features [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 1146-1154.
[11]	ZHONG Mengyuan, JIANG Lin. Review of Super-Resolution Image Reconstruction Algorithms [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 972-990.
[12]	ZHAO Pengfei, XIE Linbo, PENG Li. Deep Small Object Detection Algorithm Integrating Attention Mechanism [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 927-937.
[13]	XU Jia, WEI Tingting, YU Ge, HUANG Xinyue, LYU Pin. Review of Question Difficulty Evaluation Approaches [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 734-759.
[14]	PEI Lishen, ZHAO Xuezhuan. Survey of Collective Activity Recognition Based on Deep Learning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 775-790.
[15]	FU Xuanyi, ZHANG Luanjing, LIANG Wenke, BI Fangming, FANG Weidong. Review on Development of Anchor Mechanism in Object Detection [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 791-805.