Skeleton Based Action Recognition Algorithm on Multi-modal Lightweight Graph Convolutional Network

doi:10.3778/j.issn.1673-9418.2008051

Abstract

Abstract:

Compared with the traditional RGB-based methods, the skeleton-based action recognition methods have become the main research direction in the field of computer vision in recent years because they are less affected by many factors such as illumination, viewing angle and background complexity. However, the current skeleton-based methods still have some problems such as large parameters, long time-consuming and high computational complexity, which makes it complicated and difficult to meet the requirements of efficiency and accuracy simultaneously. To address these issues, a lightweight graph convolution network using multi-modal data fusion is proposed. Firstly, the multi-modal information flow data are fused by multi-modal data fusion method. Secondly, the spatial and temporal information of human joints are extracted using spatial and temporal modules respectively. Finally, the classification results are obtained through the fully connected layer. Experimental results conducted on the two commonly used datasets including NTU60 RGB+D and NTU120 RGB+D demonstrate that the proposed network outperforms some mainstream methods in the last two years in both recognition accuracy and efficiency, thus verifying that the network has excellent performance in terms of accuracy, while considering time efficiency and computational cost.

Key words: action recognition, human skeleton, lightweight, graph convolutional network

摘要：

与传统的基于RGB视频的行为识别任务相比，基于人体骨架的行为识别方法由于其具有受光照、视角和背景复杂度等诸多因素影响非常小的特点，使其成为近几年来计算机视觉领域的主要研究方向之一。但是目前主流的基于人体骨架的行为识别方法都或多或少地存在参数量过大，运算时间过长，计算复杂度过高等问题，从而导致这些方法难以同时满足时效性和准确度这两个要求。针对上述问题，提出了一种融合多模态数据的轻量级图卷积神经网络。首先通过多模态数据融合的方法将多种信息流数据进行融合；其次通过空间流模块和时间流模块分别获得融合后数据的空间信息和时间信息；最后通过全连接层获得最终的分类结果。在行为识别数据集NTU60 RGB+D和NTU120 RGB+D上的测试结果表明该网络不仅在识别精度上优于近两年内的一些主流方法，同时在参数量的比较上也远小于其他主流方法，从而验证了该网络在兼顾时效性和计算成本的同时，准确度上的表现也十分优异。

关键词: 行为识别, 人体骨架, 轻量级, 图卷积

SU Jiangyi, SONG Xiaoning, WU Xiaojun, YU Dongjun. Skeleton Based Action Recognition Algorithm on Multi-modal Lightweight Graph Convolutional Network[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(4): 733-742.

苏江毅, 宋晓宁, 吴小俊, 於东军. 多模态轻量级图卷积人体骨架行为识别方法[J]. 计算机科学与探索, 2021, 15(4): 733-742.

References

[1] HAN J G, SHAO L, XU D, et al. Enhanced computer vision with microsoft kinect sensor: a review[J]. IEEE Transactions on Cybernetics, 2013, 43(5): 1318-1334.
[2] YE M, ZHANG Q, WANG L, et al. A survey on human motion analysis from depth data[M]//Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications. Berlin, Heidelberg: Springer, 2013: 149-187.
[3] CAI Q, DENG Y B, LI H S, et al. Survey on human action recognition based on deep learning[J]. Computer Science, 2020, 47(4): 85-93.
蔡强, 邓毅彪, 李海生, 等. 基于深度学习的人体行为识别方法综述[J]. 计算机科学, 2020, 47(4): 85-93.
[4] YANG X D, TIAN Y L. Eigenjoints-based action recognition using Naive-Bayes-nearest-neighbor[C]//Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, Jun 16-21, 2012. Washington: IEEE Computer Society, 2012: 14-19.
[5] YANG X D, TIAN Y L. Effective 3D action recognition using eigenjoints[J]. Journal of Visual Communication and Image Representation, 2014, 25(1): 2-11.
[6] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088): 533-536.
[7] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[8] KIPF T, FETAYA E, WANG K C, et al. Neural relational inference for interacting systems[J]. arXiv:1802.04687, 2018.
[9] DU Y, WANG W, WANG L. Hierarchical recurrent neural network for skeleton based action recognition[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 1110-1118.
[10] LIU H, TU J H, LIU M Y. Two-stream 3D convolutional neural network for skeleton-based action recognition[J]. arXiv:1705.08106, 2017.
[11] YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, Feb 2-7, 2018. Menlo Park: AAAI, 2018.
[12] SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Washington: IEEE Computer Society, 2019: 12026-12035.
[13] ZHANG P F, LAN C L, ZENG W J, et al. Semantics-guided neural networks for efficient skeleton-based human action recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 1112-1121.
[14] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion, Honolulu, Jul 21-26, 2017. Washington: IEEE Com-puter Society, 2017: 4700-4708.
[15] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion, Las Vegas, Jun 27-30, 2016. Washington: IEEE Com-puter Society, 2016: 770-778.
[16] LI M S, CHEN S H, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Washington: IEEE Computer Society, 2019: 3595-3603.
[17] ZHANG P F, LAN C L, XING J L, et al. View adaptive neural networks for high performance skeleton-based human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1963-1978.
[18] SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Washington: IEEE Computer Society, 2019: 1227-1236.
[19] LIU J, SHAHROUDY A, PEREZ M L, et al. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding[J]. arXiv:1905.04757v1, 2019.
[20] SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 1010-1019.
[21] LIAO S J, LYONS T, YANG W X, et al. Learning stochastic differential equations using RNN with log signature features[J]. arXiv:1908.08286, 2019.
[22] LIU M Y, YUAN J S. Recognizing human actions as the evolution of pose estimation maps[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 1159-1168.
[23] PAPADOPOULOS K, GHORBEL E, AOUADA D, et al. Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition[J]. arXiv:1912.09745, 2019.
[24] CAETANO C, BRéMOND F, SCHWARTZ W R. Skeleton image representation for 3D action recognition based on tree structure and reference joints[C]//Proceedings of the 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images, Rio de Janeiro, Oct 28-30, 2019. Piscataway: IEEE, 2019: 16-23.