空间注意力与位置优化的三维人体姿态估计域适应算法

doi:10.3778/j.issn.1673-9418.2307016

摘要/Abstract

摘要： 现有三维人体姿态估计器在单个数据集上表现较好，但受限于训练数据姿态结构的单一，其在跨域实验上的泛化性不足。现有方法通过增加姿态多样性来弥补该缺陷，然而这些方法生成的新姿态缺乏真实有效性且姿态全局位置的分布与目标数据集仍存在显著差距。针对上述问题，提出一种基于生成对抗网络（GAN）的空间注意力与全局位置优化的三维人体姿态估计域适应算法。算法引入空间节点注意力模块约束生成器产生更自然的人体姿态，并结合姿态位置修正模块促使生成姿态向目标数据域对齐，从而解决以上域适应问题。此外，为了提升估计器训练的稳定性，提出一种端到端随机混合的训练策略，使姿态估计器可兼顾新旧数据信息的学习。作为一种生成式的域适应方法，该算法可以高效地应用于各种二阶段三维人体姿态估计器。通过跨场景实验与跨数据集实验，结果表明所提算法在多个基准数据集上的表现均达到当前最佳。其中在3DHP数据集中，该方法MPJPE与AUC指标相比最优工作优化了1.7%和1.4%，验证了所提算法可有效提高三维人体姿态估计器的泛化性。

关键词: 三维人体姿态估计, 无监督域适应, 生成对抗网络（GAN）, 注意力机制

Abstract: Existing 3D human pose estimators perform well on a single dataset but are limited by the single pose structure of the training data, resulting in insufficient generalization to cross-domain experiments. Existing methods mitigate this deficiency by increasing pose diversity, but their generated poses often lack validity. Moreover, there is still a significant gap between the global positions of poses in the target and source domains. To address these issues, a spatial attention and global position optimization domain adaptation algorithm for 3D human pose estimation based on generative adversarial network (GAN) is proposed. The algorithm introduces a spatial node attention module to constrain the generator to produce more natural human poses, and combines it with a pose position correction module to drive the generated poses to align to the target data domain, thus solving the above domain adaptation problem. In addition, in order to improve the stability of estimator training, an end-to-end stochastic hybrid training strategy is proposed so that the pose estimator can take into account the learning of new and old data information. As a generative domain adaptation method, this algorithm can be efficiently applied to various two-stage 3D human pose estimators. Through cross-scene experiments and cross-dataset experiments, the results show that the proposed algorithm achieves the current best performance on several benchmark datasets. Among them, in the 3DHP dataset, the MPJPE and AUC metrics of the proposed method are optimized by 1.7% and 1.4% compared with the optimal work, which verifies that the proposed algorithm can effectively improve the generalization of 3D human pose estimators.

Key words: 3D human pose estimation, unsupervised domain adaptation, generative adversarial network (GAN), attention mechanism

姜友鹏, 华阳, 宋晓宁. 空间注意力与位置优化的三维人体姿态估计域适应算法[J]. 计算机科学与探索, 2024, 18(9): 2384-2394.

JIANG Youpeng, HUA Yang, SONG Xiaoning. Domain Adaptation Algorithm for 3D Human Pose Estimation with Spatial Attention and Position Optimization[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2384-2394.

参考文献

[1] 范苍宁, 刘鹏, 肖婷, 等. 深度域适应综述: 一般情况与复杂情况[J]. 自动化学报, 2021, 47(3): 515-548.
FAN C N, LIU P, XIAO T, et al. A review of deep domain adaptation: general situation and complex situation[J]. Acta Automatica Sinica, 2021, 47(3): 515-548.
[2] MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3D human pose estimation[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 2640-2649.
[3] ZHOU X, HUANG Q, SUN X, et al. Towards 3D human pose estimation in the wild: a weakly-supervised approach[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 398-407.
[4] PAVLAKOS G, ZHOU X, DERPANIS K G, et al. Coarse-to-fine volumetric prediction for single-image 3D human pose[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 7025-7034.
[5] WANG J, TAN S, ZHEN X, et al. Deep 3D human pose esti-mation: a review[J]. Computer Vision and Image Understanding, 2021, 210: 103225.
[6] 王仕宸, 黄凯, 陈志刚, 等. 深度学习的三维人体姿态估计综述[J]. 计算机科学与探索, 2023, 17(1): 74-87.
WANG S C, HUANG K, CHEN Z G, et al. Survey on 3D human pose estimation of deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 74-87.
[7] SIGAL L, BALAN A O, BLACK M J. Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion[J]. International Journal of Computer Vision, 2010, 87: 4.
[8] XIE F, SHEN H, YU Y, et al. Detection of weak small image target based on brain-computer interface[C]//Proceedings of the 2021 IEEE 4th International Conference on Electronics Technology, Chengdu, May 7-10, 2021. Piscataway: IEEE, 2021: 1218-1222.
[9] SONG Y F, ZHANG Z, SHAN C, et al. Constructing stronger and faster baselines for skeleton-based action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45: 1474-1488.
[10] 龚苏明, 陈莹. 时空特征金字塔模块下的视频行为识别[J]. 计算机科学与探索, 2022, 16(9): 2061-2067.
GONG S M, CHEN Y. Video action recognition based on spatio-temporal feature pyramid module[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2061-2067.
[11] SPURR A, DAHIYA A, WANG X, et al. Self-supervised 3D hand pose estimation from monocular RGB via contrastive learning[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Mar 10, 2021. Piscataway: IEEE, 2021: 11230-11239.
[12] CHEN C H, TYAGI A, AGRAWAL A, et al. Unsupervised 3D pose estimation with geometric self-supervision[C]//Pro-ceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 5714-5724.
[13] RHODIN H, SALZMANN M, FUA P. Unsupervised geometry-aware representation for 3D human pose estimation[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 750-767.
[14] CAO J, TANG H, FANG H S, et al. Cross-domain adaptation for animal pose estimation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 9498-9507.
[15] KUNDU J N, SETH S, YM P, et al. Uncertainty-aware adaptation for self-supervised 3D human pose estimation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-24, 2022. Piscateway: IEEE, 2022: 20448-20459.
[16] LIN K, WANG L, LIU Z. Mesh graphormer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 12939-12948.
[17] LUO C, CHU X, YUILLE A. Orinet: a fully convolutional network for 3D human pose estimation[EB/OL]. [2023-05-23]. https://arxiv.org/abs/1811.04989.
[18] MEHTA D, SOTNYCHENKO O, MUELLER F, et al. Single-shot multi-person 3D pose estimation from monocular RGB[C]//Proceedings of the 2018 International Conference on 3D Vision, Verona, Sep 5-8, 2018. Piscataway: IEEE, 2018: 120-130.
[19] LIU R, SHEN J, WANG H, et al. Attention mechanism exploits temporal contexts: real-time 3D human pose reconstruction[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020: 5064-5073.
[20] WANG J, YAN S, XIONG Y, et al. Motion guided 3D pose estimation from videos[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 764-780.
[21] KOCABAS M, ATHANASIOU N, BLACK M J. Vibe: video inference for human body pose and shape estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Pis-cataway: IEEE, 2020: 5253-5263.
[22] ZHANG J, NIE X, FENG J. Inference stage optimization for cross-scenario 3D human pose estimation[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 2408-2419.
[23] WANG Z, SHIN D, FOWLKES C C. Predicting camera viewpoint improves cross-dataset generalization for 3D human pose estimation[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 523-540.
[24] GUAN S, XU J, WANG Y, et al. Bilevel online adaptation for out-of-domain human mesh reconstruction[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 10472-10481.
[25] ZENG A, SUN X, HUANG F, et al. SRNet: improving generalization in 3D human pose estimation with a split-and-recombine approach[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 507-523.
[26] ZHENG C, ZHU S, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 11656-11665.
[27] ZHANG J, TU Z, YANG J, et al. MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-24, 2022. Piscateway: IEEE, 2022: 13232-13242.
[28] GONG K, ZHANG J, FENG J. Poseaug: a differentiable pose augmentation framework for 3D human pose estimation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 8575-8584.
[29] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63: 139-144.
[30] GHOLAMI M, WANDT B, RHODIN H, et al. AdaptPose: cross-dataset adaptation for 3D human pose estimation by learnable motion generation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-24, 2022. Piscateway: IEEE, 2022:13075-13085.
[31] PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 7753-7762.
[32] LI W, LIU H, DING R, et al. Exploiting temporal contexts with strided transformer for 3D human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 25: 1282-1293.
[33] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[34] MAO X, LI Q, XIE H, et al. Least squares generative adversarial networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 2794-2802.
[35] IONESCU C, PAPAVA D, OLARU V, et al. Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36: 1325-1339.
[36] MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]//Proceedings of the 2017 International Conference on 3D Vision, Qingdao, Oct 10-12, 2017. Piscataway: IEEE, 2017:506-516.
[37] VON MARCARD T, HENSCHEL R, BLACK M J, et al. Recovering accurate 3D human pose in the wild using IMUs and a moving camera[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 601-617.
[38] LI S, KE L, PRATAMA K, et al. Cascaded deep monocular 3D human pose estimation with evolutionary training data[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020: 6173-6183.
[39] MEHTA D, SRIDHAR S, SOTNYCHENKO O, et al. VNect: real-time 3D human pose estimation with a single RGB camera[J]. ACM Transactions on Graphics, 2017, 36: 1-14.
[40] CHAI W, JIANG Z, HWANG J N, et al. Global adaptation meets local generalization: unsupervised domain adaptation for 3D human pose estimation[EB/OL]. [2023-05-23]. https://arxiv.org/abs/2303.16456.
[41] JOO H, NEVEROVA N, VEDALDI A. Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation[C]//Proceedings of the 2021 International Conference on 3D Vision, Dec 1-3, 2021. Piscataway: IEEE, 2021: 42-52.
[42] DOERSCH C, ZISSERMAN A. Sim2real transfer learning for 3D human pose estimation: motion to the rescue[C]//Advances in Neural Information Processing Systems 32, Vancouver,Dec 8-14, 2019: 12929-12941.
[43] KOLOTOUROS N, PAVLAKOS G, BLACK M J, et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 2252-2261.