计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (9): 2137-2147.DOI: 10.3778/j.issn.1673-9418.2206060

• 图形·图像 • 上一篇    下一篇

基于Transformer-CVAE的三维人体动画生成方法

冯文科,石敏,朱登明,李兆歆   

  1. 1. 华北电力大学 控制与计算机工程学院,北京 102206
    2. 中国科学院 计算技术研究所 前瞻研究实验室,北京 100190
  • 出版日期:2023-09-01 发布日期:2023-09-01

3D Human Animation Synthesis with Transformer-CVAE

FENG Wenke, SHI Min, ZHU Dengming, LI Zhaoxin   

  1. 1. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China
    2. Prospective Research Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
  • Online:2023-09-01 Published:2023-09-01

摘要: 三维人体动画生成技术是三维动画领域的核心技术。基于动作捕捉的人体动画生成方法通常制作流程较为繁琐、制作周期较长,无法快速生成人体动画,而现有数据驱动的方法生成的人体动画缺乏真实性,且生成人体运动的种类相对有限。基于此,提出了一种基于Transformer-CVAE的三维人体动画生成方法。首先,基于真实的人体运动构建人体运动数据集,并按照运动种类进行类别划分;其次,基于Transformer网络架构学习运动序列的时序依赖关系,进一步引入变分自编码器结构学习运动序列在隐空间上的概率分布;然后,在隐空间施加约束条件进而控制生成人体运动的效果;最后,在AMASS、HumanACT12、UESTC等数据集上进行实验,并从视觉效果与网络性能两方面对方法进行分析。实验结果表明,与现有方法相比,所提方法可生成种类丰富、真实细腻的人体动画,且在STED、RMSE等指标上具有明显的提升。

关键词: Transformer, 条件变分自编码器, 三维人体动画, 计算机图形学

Abstract: 3D human animation synthesis is a dominant technology in the domain of 3D animation. Traditional workflows depending on motion capture cannot generate human animation quickly due to complicated procedure and long authoring period. Existing data-driven methods have limited learning capability and therefore the gene-rated animations are lack of realism and the categories of the generation are relatively limited. To that end, this paper presents a 3D human animation synthesis method based on a Transformer-based conditional variation auto-encoder (Transformer-CVAE). Firstly, the motion dataset is constructed and classified by the motion category. Then, the temporal relationship between different frames in a common sequence is established by means of the Transformer architecture, and a variational autoencoder is further combined with the Transformer to infer the probabilistic distribution of human motions. Next, to control the desired body motion generated, the constraints are imposed on the latent space. Finally, a series of experiments are conducted on AMASS, HumanACT12 and UESTC datasets and the qualitative and quantitative evaluation is made from two aspects: the visual effect and the performance. Experimental results demonstrate that the method achieves superior performance in the metrics like STED, RMSE, etc. compared with the state-of-art, while capable of synthesizing various human animations with realism.

Key words: Transformer, conditional variational auto-encoder, 3D human animation, computer graphics