计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (11): 2184-2192.DOI: 10.3778/j.issn.1673-9418.2008027

• 人工智能 • 上一篇    下一篇

任务感知双原型网络的人物交互少样本识别

安平,冀中,刘西瑶   

  1. 天津大学 电气自动化与信息工程学院,天津 300072
  • 出版日期:2021-11-01 发布日期:2021-11-09

Task-Aware Dual Prototypical Network for Few-Shot Human-Object Interaction Recognition

AN Ping, JI Zhong, LIU Xiyao   

  1. School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
  • Online:2021-11-01 Published:2021-11-09

摘要:

人物交互(HOI)识别是计算机视觉领域的重要研究热点。随着深度学习在图像分类任务中的巨大成功,人物交互识别任务也取得重大进展,但样本不平衡和组合爆炸问题仍是制约当前人物交互识别任务性能的关键挑战。由此,将人物交互识别任务与少样本学习相结合,将人物交互识别任务定义为一个少样本任务,并提出了任务感知双原型网络(TDP-Net)来解决少样本人物交互任务。具体地,首先使用图方法为每个任务生成语义感知的任务表示作为任务的先验信息,并使用语义图注意力模块(SGA-Module)生成注意力权重,对特征图中不同区域进行不同重要程度的关注,以适应不同任务条件下的映射关系,实现在新任务中自动推理。此外,还设计了一个双路原型模块(DP-Module)以分别产生交互类别的动作类原型和物体类原型,并分别对动词和名词进行分类。通过分别为动作和物体建立类原型,有效地分离了动作和物体间复杂的视觉关系。同时由于人物交互类别之间具有相似性,可通过重新组合动作和物体类别将知识迁移到新的交互类别中。实验结果表明,该模型在人物交互少样本任务上的平均准确率比基线方法在两个实验设置上分别提高了3.2个百分点和15.7个百分点,验证了TDP-Net在少样本人物交互任务中的有效性。

关键词: 计算机视觉, 图像分类, 人物交互(HOI), 少样本学习(FSL), 注意力机制

Abstract:

Recognizing human-object interaction (HOI) is an important research topic in computer vision. With the great success of deep learning in image classification, the HOI recognition task has also made great progress. However, the problems of instance imbalance and combinatorial explosion still remain the key challenges, which restrict the performance of HOI recognition methods. Therefore, this paper formulates HOI recognition in a few-shot scene to tackle the above problems and proposes a novel task-aware dual prototypical network (TDP-Net) to address few-shot HOI task. Specifically, it first assigns semantic-aware task representations for different tasks as their prior knowledge, subsequently generates attention weights by semantic graph attention module (SGA-Module). It effectively weights the importance on different regions of the visual features, adaptively for different task conditions, which realizes to reason for novel tasks. In addition, it designs a dual prototypes module (DP-Module) to generate both action class prototypes and object class prototypes, which classifies the verb and noun labels respectively. The complex visual relationships between actions and objects can be effectively separated by constructing class prototypes for actions and objects. Meanwhile, owing to the similarity among the related interactions, the knowledge is transferred to the new interactions by reorganizing the action and object prototypes. The experimental results show that the average accuracies of this model outperform the baseline by 3.2 percentage points and 15.7 percentage points on two exper-imental settings, which verifies its effectiveness on the few-shot HOI task.

Key words: computer vision, image classification, human-object interaction (HOI), few-shot learning (FSL), atten-tion mechanism