计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (2): 323-336.DOI: 10.3778/j.issn.1673-9418.2106004
收稿日期:
2021-06-01
修回日期:
2021-08-06
出版日期:
2022-02-01
发布日期:
2021-08-19
通讯作者:
+ E-mail: zhaozengshun@163.com作者简介:
阮晨钊(1996—),男,山东淄博人,硕士研究生,主要研究方向为计算机视觉、图像处理。基金资助:
RUAN Chenzhao, ZHANG Xiangsen, LIU Ke, ZHAO Zengshun+()
Received:
2021-06-01
Revised:
2021-08-06
Online:
2022-02-01
Published:
2021-08-19
About author:
RUAN Chenzhao, born in 1996, M.S. candidate. His research interests include computer vision and image processing.Supported by:
摘要:
人-物体交互检测(HOI),就是把图像作为输入,检测出图像中存在交互行为的人和物体以及他们之间的交互动词。它是计算机视觉范畴里继目标检测、图像分割和目标跟踪之后又一新任务,旨在对图像进行更深层的理解。针对目前基于深度学习的HOI检测综述性文章的空白,以HOI检测方法的发展历程为主线,对基于深度学习的HOI检测方法进行了分类与分析。首先简要总结了早期的技术方法,然后根据模型结构将现有算法分为两阶段方法和一阶段方法并对一些代表性算法进行分析介绍。将两阶段方法分为融入注意力、图模型以及姿势和身体部位三类进行重点论述,总结了每类方法的基本思想与优缺点。此外,还详细介绍了HOI检测任务的实验评价指标、基准数据集和大多数现有方法的实验结果,对不同类别的方法取得的结果进行说明。最后对该技术面临的主要挑战进行总结分析并对未来发展趋势进行展望。
中图分类号:
阮晨钊, 张祥森, 刘科, 赵增顺. 深度学习的人-物体交互检测研究进展[J]. 计算机科学与探索, 2022, 16(2): 323-336.
RUAN Chenzhao, ZHANG Xiangsen, LIU Ke, ZHAO Zengshun. Progress on Human-Object Interaction Detection of Deep Learning[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(2): 323-336.
分类 | 子类 | 代表性工作 | 优点 | 局限 | 适用场景 |
---|---|---|---|---|---|
两阶段方法 | 融入注意力 | BAR-CNN[ ICAN[ Wang[ | 能够有效提取上下文信息,准确率相较HO-RCNN有很大提高 | 除视觉信息和空间信息外并没有额外信息的引入,准确率有待提高 | 适用于训练样本充足,硬件算力高,对实时性要求较低的场景 |
融入图模型 | GPNN[ Wu[ VS-GATs[ VSGNet [ SAG [ DRG [ | 可同时预测图像中的所有交互对,能够消除配对歧义 | 鲜有视觉和空间信息外的额外信息的引入来帮助构建图模型,对硬件要求高 | ||
融入身体部位和姿态 | TIN[ PMFNet[ RPNN[ PFNet[ MLCNet[ PMN[ | 有效整合人的身体姿势或身体部分信息,准确率相对较高 | 计算量大且费时,对硬件要求高 | ||
一阶段方法 | — | PPDM[ IP-Net[ UnionDet[ AS-Net[ | 检测速度快,准确率高,易于部署 | 模型的构建与训练较为复杂 | 适用于对实时性、准确率要求较高的场景 |
表1 不同HOI检测方法比较
Table 1 Comparison of different HOI detection methods
分类 | 子类 | 代表性工作 | 优点 | 局限 | 适用场景 |
---|---|---|---|---|---|
两阶段方法 | 融入注意力 | BAR-CNN[ ICAN[ Wang[ | 能够有效提取上下文信息,准确率相较HO-RCNN有很大提高 | 除视觉信息和空间信息外并没有额外信息的引入,准确率有待提高 | 适用于训练样本充足,硬件算力高,对实时性要求较低的场景 |
融入图模型 | GPNN[ Wu[ VS-GATs[ VSGNet [ SAG [ DRG [ | 可同时预测图像中的所有交互对,能够消除配对歧义 | 鲜有视觉和空间信息外的额外信息的引入来帮助构建图模型,对硬件要求高 | ||
融入身体部位和姿态 | TIN[ PMFNet[ RPNN[ PFNet[ MLCNet[ PMN[ | 有效整合人的身体姿势或身体部分信息,准确率相对较高 | 计算量大且费时,对硬件要求高 | ||
一阶段方法 | — | PPDM[ IP-Net[ UnionDet[ AS-Net[ | 检测速度快,准确率高,易于部署 | 模型的构建与训练较为复杂 | 适用于对实时性、准确率要求较高的场景 |
真实情况 | 预测结果 | |
---|---|---|
Positive | Negative | |
True | TP(真正例) | FN(假反例) |
False | FP(假正例) | TN(真反例) |
表2 混淆矩阵
Table 2 Confusion matrix
真实情况 | 预测结果 | |
---|---|---|
Positive | Negative | |
True | TP(真正例) | FN(假反例) |
False | FP(假正例) | TN(真反例) |
Method | Backbone | mAP/% |
---|---|---|
Gupta[ | ResNet-50-FPN | 31.8 |
InteractNet[ | ResNet-50-FPN | 40.0 |
BAR-CNN[ | Inception-ResNet | 41.1 |
iCAN[ | ResNet-50 | 45.3 |
Wang[ | ResNet-50 | 47.3 |
GPNN[ | Res-DCN-152 | 44.0 |
Wang[ | ResNet-50-FPN | 52.7 |
Wu[ | VGG-16 | 44.6 |
VS-GATs[ | ResNet-50-FPN | 50.6 |
VSGNet[ | ResNet-152 | 51.8 |
DRG[ | ResNet-50-FPN | 51.0 |
TIN[ | ResNet-50 | 47.8 |
PMFNet[ | ResNet-50-FPN | 52.0 |
RPNN[ | ResNet-50 | 47.5 |
PFNet[ | ResNet-50 | 52.8 |
MLCNet[ | ResNet-50-FPN | 55.2 |
VS-GATs+PMN[ | ResNet-50-FPN | 51.8 |
IP-Net[ | Hourglass-104 | 51.0 |
UnionDet[ | ResNet-50-FPN | 47.5 |
AS-Net[ | ResNet-50 | 53.9 |
表3 V-COCO数据集测试结果
Table 3 Results on V-COCO data set
Method | Backbone | mAP/% |
---|---|---|
Gupta[ | ResNet-50-FPN | 31.8 |
InteractNet[ | ResNet-50-FPN | 40.0 |
BAR-CNN[ | Inception-ResNet | 41.1 |
iCAN[ | ResNet-50 | 45.3 |
Wang[ | ResNet-50 | 47.3 |
GPNN[ | Res-DCN-152 | 44.0 |
Wang[ | ResNet-50-FPN | 52.7 |
Wu[ | VGG-16 | 44.6 |
VS-GATs[ | ResNet-50-FPN | 50.6 |
VSGNet[ | ResNet-152 | 51.8 |
DRG[ | ResNet-50-FPN | 51.0 |
TIN[ | ResNet-50 | 47.8 |
PMFNet[ | ResNet-50-FPN | 52.0 |
RPNN[ | ResNet-50 | 47.5 |
PFNet[ | ResNet-50 | 52.8 |
MLCNet[ | ResNet-50-FPN | 55.2 |
VS-GATs+PMN[ | ResNet-50-FPN | 51.8 |
IP-Net[ | Hourglass-104 | 51.0 |
UnionDet[ | ResNet-50-FPN | 47.5 |
AS-Net[ | ResNet-50 | 53.9 |
Method | Backbone | Default | Known Object | ||||
---|---|---|---|---|---|---|---|
full | rare | non-rare | full | rare | non-rare | ||
HO-RCNN[ | CaffeNet | 7.81 | 5.37 | 8.54 | 10.41 | 8.94 | 10.85 |
InteractNet[ | ResNet-50-FPN | 9.94 | 7.16 | 10.77 | — | — | — |
iCAN[ | ResNet-50 | 14.84 | 10.45 | 16.15 | 16.26 | 11.33 | 17.73 |
Wang[ | ResNet-50 | 16.24 | 11.16 | 17.75 | 17.33 | 12.78 | 19.21 |
GPNN[ | Res-DCN-152 | 13.11 | 9.34 | 14.23 | — | — | — |
Wang[ | ResNet-50-FPN | 17.57 | 16.85 | 17.78 | 21.00 | 20.74 | 21.08 |
Wu[ | VGG-16 | 13.55 | 9.62 | 15.20 | — | — | — |
VS-GATs[ | ResNet-50-FPN | 20.27 | 16.03 | 21.54 | — | — | — |
VSGNet[ | ResNet-152 | 19.80 | 16.05 | 20.91 | — | — | — |
SAG[ | ResNet-50-FPN | 18.26 | 13.40 | 19.71 | — | — | — |
DRG[ | ResNet-50-FPN | 19.26 | 17.74 | 19.71 | 23.40 | 21.75 | 23.89 |
TIN[ | ResNet-50 | 17.03 | 13.42 | 18.11 | 19.17 | 15.51 | 20.26 |
PMFNet[ | ResNet-50-FPN | 17.46 | 15.65 | 18.00 | 20.34 | 17.47 | 21.20 |
RPNN[ | ResNet-50 | 17.35 | 12.78 | 18.71 | — | — | — |
PFNet[ | ResNet-50 | 20.05 | 16.66 | 21.07 | 24.01 | 21.09 | 24.89 |
MLCNet[ | ResNet-50-FPN | 17.95 | 16.62 | 18.35 | 22.28 | 20.73 | 22.74 |
VS-GATs+PMN[ | ResNet-50-FPN | 21.21 | 17.60 | 22.29 | — | — | — |
PPDM[ | Hourglass-104 | 21.73 | 13.78 | 24.10 | 24.58 | 16.65 | 26.84 |
IP-Net[ | Hourglass-104 | 19.56 | 12.79 | 21.58 | 22.05 | 15.77 | 23.92 |
UnionDet[ | ResNet-50-FPN | 17.58 | 11.72 | 19.33 | 19.76 | 14.68 | 21.27 |
AS-Net[ | ResNet-50 | 28.87 | 24.25 | 30.25 | 31.74 | 27.07 | 33.14 |
表4 HICO-DET数据集测试结果
Table 4 Results on HICO-DET data set %
Method | Backbone | Default | Known Object | ||||
---|---|---|---|---|---|---|---|
full | rare | non-rare | full | rare | non-rare | ||
HO-RCNN[ | CaffeNet | 7.81 | 5.37 | 8.54 | 10.41 | 8.94 | 10.85 |
InteractNet[ | ResNet-50-FPN | 9.94 | 7.16 | 10.77 | — | — | — |
iCAN[ | ResNet-50 | 14.84 | 10.45 | 16.15 | 16.26 | 11.33 | 17.73 |
Wang[ | ResNet-50 | 16.24 | 11.16 | 17.75 | 17.33 | 12.78 | 19.21 |
GPNN[ | Res-DCN-152 | 13.11 | 9.34 | 14.23 | — | — | — |
Wang[ | ResNet-50-FPN | 17.57 | 16.85 | 17.78 | 21.00 | 20.74 | 21.08 |
Wu[ | VGG-16 | 13.55 | 9.62 | 15.20 | — | — | — |
VS-GATs[ | ResNet-50-FPN | 20.27 | 16.03 | 21.54 | — | — | — |
VSGNet[ | ResNet-152 | 19.80 | 16.05 | 20.91 | — | — | — |
SAG[ | ResNet-50-FPN | 18.26 | 13.40 | 19.71 | — | — | — |
DRG[ | ResNet-50-FPN | 19.26 | 17.74 | 19.71 | 23.40 | 21.75 | 23.89 |
TIN[ | ResNet-50 | 17.03 | 13.42 | 18.11 | 19.17 | 15.51 | 20.26 |
PMFNet[ | ResNet-50-FPN | 17.46 | 15.65 | 18.00 | 20.34 | 17.47 | 21.20 |
RPNN[ | ResNet-50 | 17.35 | 12.78 | 18.71 | — | — | — |
PFNet[ | ResNet-50 | 20.05 | 16.66 | 21.07 | 24.01 | 21.09 | 24.89 |
MLCNet[ | ResNet-50-FPN | 17.95 | 16.62 | 18.35 | 22.28 | 20.73 | 22.74 |
VS-GATs+PMN[ | ResNet-50-FPN | 21.21 | 17.60 | 22.29 | — | — | — |
PPDM[ | Hourglass-104 | 21.73 | 13.78 | 24.10 | 24.58 | 16.65 | 26.84 |
IP-Net[ | Hourglass-104 | 19.56 | 12.79 | 21.58 | 22.05 | 15.77 | 23.92 |
UnionDet[ | ResNet-50-FPN | 17.58 | 11.72 | 19.33 | 19.76 | 14.68 | 21.27 |
AS-Net[ | ResNet-50 | 28.87 | 24.25 | 30.25 | 31.74 | 27.07 | 33.14 |
[1] | CHAO Y W, WANG Z, HE Y, et al. HICO: a benchmark for recognizing human-object interactions in images[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision, Venice, Dec 11-18, 2015. Washington: IEEE Computer Society, 2015: 1017-1025. |
[2] | 周以重. 人与物体交互行为算法研究与应用[D]. 泉州: 华侨大学, 2019. |
ZHOU Y Z. Investigation and application of human-object interaction detection algorithm[D]. Quanzhou: Huaqiao Uni-versity, 2019. | |
[3] | 惠文珊, 李会军, 陈萌, 等. 基于CNN-LSTM的机器人触觉识别与自适应抓取控制[J]. 仪器仪表学报, 2019, 40(1):211-218. |
HUI W S, LI H J, CHEN M, et al. Robotic tactile recogni-tion and adaptive grasping control based on CNN-LSTM[J]. Chinese Journal of Scientific Instrument, 2019, 40(1):211-218. | |
[4] | DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, San Diego, Jun 20-26, 2005. Washington: IEEE Com-puter Society, 2005: 886-893. |
[5] |
LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2):91-110.
DOI URL |
[6] | GUPTA A, DAVIS L S. Objects in action: an approach for combining action understanding and object perception[C]// Proceedings of the 2007 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition, Min-neapolis, Jun 17-22, 2007. Washington: IEEE Computer Society, 2007: 1-8. |
[7] |
GUPTA A, KEMBHAVI A, DAVIS L S. Observing human-object interactions: using spatial and functional compatibility for recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10):1775-1789.
DOI URL |
[8] | YAO B, LI F F. Grouplet: a structured image representation for recognizing human and object interactions[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, Jun 13-18, 2010. Washington: IEEE Computer Society, 2010: 9-16. |
[9] | YAO B, LI F F. Modeling mutual context of object and human pose in human-object interaction activities[C]//Proceedings of the 2010 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, San Francisco, Jun 13-18, 2010. Washington: IEEE Computer Society, 2010: 17-24. |
[10] | YAO B, JIANG X, KHOSLA A, et al. Human action recogni-tion by learning bases of action attributes and parts[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision, Kathmandu, Nov 6-13, 2011. Washington: IEEE Computer Society, 2011: 1331-1338. |
[11] | DELAITRE V, SIVIC J, LAPTEV I. Learning person-object interactions for action recognition in still images[C]//Pro-ceedings of the 25th Annual Conference on Neural Informa-tion Processing Systems, Granada, Dec 12-14, 2011. Red Hook: Curran Associates, 2011: 1503-1511. |
[12] | DESAI C, RAMANAN D. Detecting actions, poses, and objects with relational phraselets[C]//LNCS 7575: Proceedings of the 12th European Conference on Computer Vision, Oct 7-13, 2012. Berlin, Heidelberg: Springer, 2012: 158-172. |
[13] | HU J F, ZHENG W S, LAI J, et al. Recognising human-object interaction via exemplar based modelling[C]//Pro-ceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Washington: IEEE Computer Society, 2013: 3144-3151. |
[14] | GUPTA S, MALIK J. Visual semantic role labeling[J]. arXiv: 1505.04474, 2015. |
[15] | CHAO Y W, LIU Y, LIU X, et al. Learning to detect human-object interactions[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, Mar 12-15, 2018. Washington: IEEE Computer Society, 2018: 381-389. |
[16] |
REN S Q, HE K M, GIRSHICK R B, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(6):1137-1149.
DOI URL |
[17] | GIRSHICK R B, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and seman-tic segmentation[C]//Proceedings of the 27th IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, Honolulu, Jun 23-28, 2014. Washington: IEEE Computer Society, 2014: 580-587. |
[18] | GIRSHICK R B. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1440-1448. |
[19] | SUTSKEVER I, VINYALS O, LE Q V. Sequence to se-quence learning with neural networks[C]//Proceedings of the 28th Annual Conference on Neural Information Pro-cessing Systems, Montreal, Dec 8-13, 2014. Red Hook: Curran Associates, 2014: 3104-3112. |
[20] | VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]//Proceedings of the 2015 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Wa-shington: IEEE Computer Society, 2015: 3156-3164. |
[21] | CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shang-hai, Mar 20-25, 2016. Piscataway: IEEE, 2016: 4960-4964. |
[22] | GKIOXARI G, TOSHEV A, JAITLY N. Chained predictions using convolutional neural networks[C]//LNCS 9908: Pro-ceedings of the 14th European Conference on Computer Vision, Oct 11-14, 2016. Cham: Springer, 2016: 728-743. |
[23] | GEORGIA G, GIRSHICK R B, DOLLÁR P, et al. Detec-ting and recognizing human-object interactions[C]//Procee-dings of the 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 8359-8367. |
[24] | KOLESNIKOV A, KUZNETSOVA A, LAMPERT C H, et al. Detecting visual relationships using box attention[C]//Pro-ceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop, Seoul, Oct 27-28, 2019. Piscataway: IEEE, 2019: 1749-1753. |
[25] | GAO C, ZOU Y, HUANG J B. ICAN: instance-centric attention network for human-object interaction detection[J]. arXiv:1808.10437, 2018. |
[26] | CHERON G, LAPTEV I, SCHMID C. P-CNN: pose-based CNN features for action recognition[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 3218-3226. |
[27] | MALLYA A, LAZEBNIK S. Learning models for actions and person-object interactions with transfer to question answering[C]//LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 414-428. |
[28] | GKIOXARI G, GIRSHICK R B, MALIK J. Contextual action recognition with R*CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1080-1088. |
[29] | WANG T C, ANWER R M, KHAN M H, et al. Deep contextual attention for human-object interaction detection[C]//Proceedings of the 2019 IEEE International Con-ference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 5694-5702. |
[30] | GILMER J, SCHOENHOLZ S S, RILEY P F, et al. Neural message passing for quantum chemistry[C]//Proceedings of the 34th International Conference on Machine Learning, Sydney, Aug 6-11, 2017. New York: ACM, 2017: 1263-1272. |
[31] | JAIN A, ZAMIR A R, SAVARESE S, et al. Structural-RNN: deep learning on spatio-temporal graphs[C]//Procee-dings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 5308-5317. |
[32] | LI R Y, TAPASWI M, LIAO R J, et al. Situation recogni-tion with graph neural networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 4183-4192. |
[33] | MARINO K, SALAKHUTDINOV R, GUPTA A. The more you know: using knowledge graphs for image classification[J]. arXiv:1612.04844, 2016. |
[34] | XU D F, ZHU Y K, CHOY C B, et al. Scene graph gene-ration by iterative message passing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3097-3106. |
[35] | LIANG X D, SHEN X H, FENG J S, et al. Semantic object parsing with graph LSTM[C]//LNCS 9905: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 125-143. |
[36] | YUAN Y, LIANG X D, WANG X L, et al. Temporal dyna-mic graph LSTM for action-driven video object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington:IEEE Computer Society, 2017: 1819-1828. |
[37] | TENEY D, LIU L Q, VAN DEN HENGEL A. Graph-structured representations for visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3233-3241. |
[38] | QI S Y, WANG W G, JIA B X, et al. Learning human-object interactions by graph parsing neural networks[C]//LNCS 11213: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 407-423. |
[39] | KOPPULA H S, SAXENA A. Anticipating human activities using object affordances for reactive robotic response[J]. IEEE Transactions on Pattern Analysis and Machine In-telligence, 2015, 38(1):14-29. |
[40] | WANG H, ZHENG W S, LING Y B. Contextual hetero-geneous graph network for human-object interaction detec-tion[C]//LNCS 12362: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 248-264. |
[41] | 吴伟, 刘泽宇. 基于图的人-物交互识别[J]. 计算机工程与应用, 2021, 57(3):175-181. |
WU W, LIU Z Y. Graph-based human-object interactions recognition[J]. Computer Engineering and Applications, 2021, 57(3):175-181. | |
[42] | LIANG Z J, ROJAS J, LIU J F, et al. Visual-semantic graph attention networks for human-object interaction detection[J]. arXiv:2001.02302, 2020. |
[43] | ULUTAN O, IFTEKHAR A S M, MANJUNATH B S. VSGNet: spatial attention network for detecting human object interactions using graph convolutions[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 13614-13623. |
[44] | ZHANG F Z, CAMPBELL D, GOULD S. Spatio-attentive graphs for human-object interaction detection[J]. arXiv: 2012.06060, 2020. |
[45] | GAO C, XU J R, ZOU Y L, et al. DRG: dual relation graph for human-object interaction detection[C]//LNCS 12357: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 696-712. |
[46] | FANG H S, CAO J K, TAI Y W, et al. Pairwise body-part attention for recognizing human-object interactions[C]//LNCS 11214: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 51-67. |
[47] | LI Y L, ZHOU S Y, HUANG X J, et al. Transferable interactiveness knowledge for human-object interaction de-tection[C]//Proceedings of the 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 3585-3594. |
[48] | WAN B, ZHOU D S, LIU Y F, et al. Pose-aware multi-level feature network for human object interaction detection[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 9468-9477. |
[49] | ZHOU P H, CHI M M. Relation parsing neural network for human-object interaction detection[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 843-851. |
[50] |
LIU H C, MU T J, HUANG X L. Detecting human-object interaction with multi-level pairwise feature network[J]. Computational Visual Media, 2021, 7(2):229-239.
DOI URL |
[51] | SUN X, HU X W, REN T W, et al. Human object interac-tion detection via multi-level conditioned network[C]//Pro-ceedings of the 2020 International Conference on Multi-media Retrieval, Dublin, Jun 8-11, 2020. New York: ACM, 2020: 26-34. |
[52] | LIANG Z J, LIU J F, GUAN Y S, et al. Pose-based modular network for human-object interaction detection[J]. arXiv: 2008.02042, 2020. |
[53] | LIAO Y, LIU S, WANG F, et al. PPDM: parallel point detection and matching for real-time human-object interac-tion detection[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Washington: IEEE Computer Society, 2020: 479-487. |
[54] | WANG T, YANG T, MARTIN D, et al. Learning human-object interaction detection using interaction points[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Washington: IEEE Computer Society, 2020: 4116-4125. |
[55] | KIM B, CHOI T, KANG J, et al. UnionDet: union-level detector towards real-time human-object interaction detec-tion[C]//LNCS 12360: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 498-514. |
[56] | LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 21-37. |
[57] | LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 2999-3007. |
[58] | ZHOU P, NI B, GENG C, et al. Scale-transferrable object detection[C]//Proceedings of the 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 528-537. |
[59] | CHEN M, LIAO Y, LIU S, et al. Reformulating HOI detection as adaptive set prediction[J]. arXiv:2103.05983, 2021. |
[60] | LIN T Y, MAIRE M, BELONGIE S J, et al. Microsoft COCO: common objects in context[C]//LNCS 8693: Procee-dings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 740-755. |
[61] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washing-ton: IEEE Computer Society, 2016: 770-778. |
[62] | LIN T Y, DOLLÁR P, GIRSHICK R B, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 936-944. |
[63] | DAI J F, QI H Z, XIONG Y W, et al. Deformable convolu-tional networks[C]//Proceedings of the 2017 IEEE Interna-tional Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 764-773. |
[64] | JIA Y Q, SHELHAMER E, DONAHUE J, et al. Caffe: convolutional architecture for fast feature embedding[C]//Proceedings of the 2014 ACM Conference on Multimedia, Orlando, Nov 3-7, 2014. New York: ACM, 2014: 675-678. |
[65] | NEWELL A, YANG K Y, JIA D. Stacked hourglass net-works for human pose estimation[C]//LNCS 9912: Procee-dings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 483-499. |
[66] | SHEN L Y, YEUNG S, HOFFMAN J, et al. Scaling human-object interaction recognition through zero-shot learning[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, Mar 12-15, 2018. Washington: IEEE Computer Society, 2018: 1568-1576. |
[67] |
JI Z, LIU X Y, PANG Y W, et al. Few-shot human-object interaction recognition with semantic-guided attentive proto-types network[J]. IEEE Transactions on Image Processing, 2020, 30:1648-1661.
DOI URL |
[68] |
LIU X Y, JI Z, PANG Y W, et al. DGIG-Net: dynamic graph-in-graph networks for few-shot human-object interac-tion[J]. IEEE Transactions on Cybernetics, 2021: 1-13. DOI: 10.1109/TCYB.2021.3049537.
DOI |
[1] | 安凤平, 李晓薇, 曹翔. 权重初始化-滑动窗口CNN的医学图像分类[J]. 计算机科学与探索, 2022, 16(8): 1885-1897. |
[2] | 曾凡智, 许露倩, 周燕, 周月霞, 廖俊玮. 面向智慧教育的知识追踪模型研究综述[J]. 计算机科学与探索, 2022, 16(8): 1742-1763. |
[3] | 刘艺, 李蒙蒙, 郑奇斌, 秦伟, 任小广. 视频目标跟踪算法综述[J]. 计算机科学与探索, 2022, 16(7): 1504-1515. |
[4] | 赵小明, 杨轶娇, 张石清. 面向深度学习的多模态情感识别研究进展[J]. 计算机科学与探索, 2022, 16(7): 1479-1503. |
[5] | 夏鸿斌, 肖奕飞, 刘渊. 融合自注意力机制的长文本生成对抗网络模型[J]. 计算机科学与探索, 2022, 16(7): 1603-1610. |
[6] | 彭豪, 李晓明. 多尺度选择金字塔网络的小样本目标检测算法[J]. 计算机科学与探索, 2022, 16(7): 1649-1660. |
[7] | 张好聪, 李涛, 邢立冬, 潘风蕊. OpenVX特征抽取函数在可编程并行架构的实现[J]. 计算机科学与探索, 2022, 16(7): 1583-1593. |
[8] | 孙方伟, 李承阳, 谢永强, 李忠博, 杨才东, 齐锦. 深度学习应用于遮挡目标检测算法综述[J]. 计算机科学与探索, 2022, 16(6): 1243-1259. |
[9] | 刘雅芬, 郑艺峰, 江铃燚, 李国和, 张文杰. 深度半监督学习中伪标签方法综述[J]. 计算机科学与探索, 2022, 16(6): 1279-1290. |
[10] | 董文轩, 梁宏涛, 刘国柱, 胡强, 于旭. 深度卷积应用于目标检测算法综述[J]. 计算机科学与探索, 2022, 16(5): 1025-1042. |
[11] | 程卫月, 张雪琴, 林克正, 李骜. 融合全局与局部特征的深度卷积神经网络算法[J]. 计算机科学与探索, 2022, 16(5): 1146-1154. |
[12] | 钟梦圆, 姜麟. 超分辨率图像重建算法综述[J]. 计算机科学与探索, 2022, 16(5): 972-990. |
[13] | 伏轩仪, 张銮景, 梁文科, 毕方明, 房卫东. 锚点机制在目标检测领域的发展综述[J]. 计算机科学与探索, 2022, 16(4): 791-805. |
[14] | 赵鹏飞, 谢林柏, 彭力. 融合注意力机制的深层次小目标检测算法[J]. 计算机科学与探索, 2022, 16(4): 927-937. |
[15] | 裴利沈, 赵雪专. 群体行为识别深度学习方法研究综述[J]. 计算机科学与探索, 2022, 16(4): 775-790. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||