计算机科学与探索

• 学术研究 •    下一篇

语音驱动手势动作生成前沿进展

张亚宇,温玉辉,张欣雨,景丽萍
  

  1. 1.北京交通大学 计算机科学与技术学院, 北京 100044
    2.交通数据挖掘与具身智能北京市重点实验室(北京交通大学), 北京 100044

Recent Advances in Speech-Driven Gesture Generation

ZHANG Yayu,  WEN Yuhui,  ZHANG Xinyu,  JING Liping   

  1. 1.Beijing Jiaotong University, Beijing 100044,China
    2.Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence (Beijing Jiaotong University), Beijing 100044, China

摘要: 在人际沟通中,手势动作可以丰富语言信息,促进信息传递。语音驱动手势动作生成旨在通过语音输入条件,自动合成自然逼真且符合语境的手势动作序列。这一研究方向在计算机图形学和计算机视觉等领域受到广泛关注,并在电影动画制作、人机交互和虚拟现实等领域具有重要的应用价值。早期基于规则的方法法效率低下,而回归方法虽然提升了生成效率,却容易导致动作模式单一、缺乏表现力。近年来,生成模型进一步推动该领域发展,有效提升了生成手势的质量和多样性。针对基于生成模型的语音驱动手势动作生成方法,总结并归纳了基于生成式对抗网络、变分自编码器和扩散模型的相关研究,分析了不同生成模型在手势动作生成上的应用及其优缺点。进一步探讨了语音驱动手势生成在情感表达、语义一致性及风格迁移方面的可控性。然后,讨论了面部表情和手势动作协同生成的相关研究。此外,介绍了常用数据集以及评估指标,并对代表性方法行了实验对比分析。最后,总结当前语音驱动手势动作生成领域面临的挑战并展望未来研究的发展趋势。

关键词: 手势生成, 语音驱动, 生成模型, 风格控制

Abstract: In interpersonal communication, gestures enrich verbal information and facilitate information delivery. Speech-driven gesture generation aims to automatically synthesize natural, realistic, and contextually appropriate sequences of gestures conditioned on speech input. This research direction has attracted widespread attention in fields such as computer graphics and computer vision, holding significant application value in domains including film animation production, human-computer interaction, and virtual reality. Early rule-based methods suffer from inefficiency, while regression methods, despite improving generation efficiency, often result in gestures with repetitive motion patterns and limited expressiveness.  In recent years, generative models have further advanced this field, effectively enhancing the quality and diversity of generated gestures. Regarding speech-driven gesture generation methods based on generative models, this work summarizes and categorizes relevant research on generative adversarial networks, variational autoencoders, and diffusion models, analyzing their respective applications, advantages, and disadvantages in gesture generation. It further explores the controllability of speech-driven gesture generation in emotion expression, semantic consistency, and style transfer. Moreover, collaborative generation research combining facial expressions and gestures is discussed. Additionally, commonly used datasets and evaluation metrics are introduced, followed by experimental comparative analysis of representative methods. Finally, the paper concludes by summarizing the challenges in the field of speech-driven gesture generation and outlining future research trends.

Key words: gesture generation, speech-driven, generative models, style control