计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (3): 719-732.DOI: 10.3778/j.issn.1673-9418.2106102

• 人工智能·模式识别 • 上一篇    下一篇

用于人体动作识别的多尺度时空图卷积算法

赵登阁,智敏   

  1. 内蒙古师范大学 计算机科学技术学院,呼和浩特 010000
  • 出版日期:2023-03-01 发布日期:2023-03-01

Spatial Multiple-Temporal Graph Convolutional Neural Network for Human Action Recognition

ZHAO Dengge, ZHI Min   

  1. College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010000, China
  • Online:2023-03-01 Published:2023-03-01

摘要: 基于骨骼数据的时空图卷积人体动作识别网络(ST-GCN)存在时间卷积层结构单一、固定的问题,难以全面提取每个动作类别所需的全部重要阶段特征。针对这一问题,提出了包含多个不同尺度卷积核和多种结构的时间图卷积层,构造了多尺度时空图卷积网络(SMT-GCN),利用不同的时间图卷积操作抽取并融合不同尺度的时间轨迹特征。同时,为了强化人体长距离关联信息和空间结构化特征,在SMT-GCN中融合了构造的变换残差模块(Tran-Res)和轻量级注意力模块(CBAM),构造了多尺度时空图注意卷积网络(SAMT-GCN)。实验在NTU RGB+D数据集和HDM05数据集上进行,提出的SMT-GCN和SAMT-GCN均获得了识别精度的提升;另外,设计的多尺度时间图卷积模块可以融合于其他基线网络中并提高性能。为探究卷积核尺度及结构对算法的影响,设计了相应消融实验,实验结果表明卷积核大小为1、5、9的SAMT-GCN性能最优,并且具有稠密结构的网络识别精度要高于具有串行和并行结构的网络。

关键词: 人体动作识别, 时空图卷积网络(ST-GCN), 多尺度时间图卷积, 变换残差模块(Tran-Res), 轻量级注意力

Abstract: Spatial-temporal graph convolutional neural network (ST-GCN) based on skeleton data for human action recognition has the problem of monotonous and fixed temporal convolution architecture, which is difficult to cover all the stages of each action execution completely. In respect of the issues above, a novel multiple-temporal graph convolution layer with multiple convolution kernels and multiple architectures is proposed. Spatial multiple-temporal graph convolutional neural network (SMT-GCN) is built using different temporal graph convolution operations to extract and fuse temporal trajectory features of different scales. Further, a Transformer-Resnet (Tran-Res) module and a light attention module (convolutional block attention module,CBAM) are embedded into the SMT-GCN to enhance the human long-range connecting information and the spatial structual features, and spatial-attentive multiple-temporal graph convolutional neural network (SAMT-GCN) is built. Experments are performed on NTU RGB+D dataset and HDM05 dataset. The human action recognition accuracies are improved by both the proposed SMT-GCN and SAMT-GCN. In addition, the proposed multiple-temporal graph convolution modules are aggregated into other baselines to improve their performance. Moreover, an ablation study is performed for exploring the influence of multiple kernel sizes and architectures on the proposed algorithm. Experimental results show that the SAMT-GCN gets the best performance with convolution kernel size 1,5,9, and the accuracies of networks with densely architecture are better than those with serial and parallel architectures.

Key words: human action recognition, spatial-temporal graph convolutional neural network (ST-GCN), multiple- temporal graph convolution, Transformer-Resnet (Tran-Res), light attention module