语音驱动手势动作生成前沿进展

doi:10.3778/j.issn.1673-9418.2505081

摘要/Abstract

摘要： 在人际沟通中，手势动作可以丰富语言信息，促进信息传递。语音驱动手势动作生成旨在通过语音输入条件，自动合成自然逼真且符合语境的手势动作序列。这一研究方向在计算机图形学和计算机视觉等领域受到广泛关注，并在电影动画制作、人机交互和虚拟现实等领域具有重要的应用价值。早期基于规则的方法法效率低下，而回归方法虽然提升了生成效率，却容易导致动作模式单一、缺乏表现力。近年来，生成模型进一步推动该领域发展，有效提升了生成手势的质量和多样性。针对基于生成模型的语音驱动手势动作生成方法，总结并归纳了基于生成式对抗网络、变分自编码器和扩散模型的相关研究，分析了不同生成模型在手势动作生成上的应用及其优缺点。进一步探讨了语音驱动手势生成在情感表达、语义一致性及风格迁移方面的可控性。然后，讨论了面部表情和手势动作协同生成的相关研究。此外，介绍了常用数据集以及评估指标，并对代表性方法行了实验对比分析。最后，总结当前语音驱动手势动作生成领域面临的挑战并展望未来研究的发展趋势。

关键词: 手势生成, 语音驱动, 生成模型, 风格控制

Abstract: In interpersonal communication, gestures enrich verbal information and facilitate information delivery. Speech-driven gesture generation aims to automatically synthesize natural, realistic, and contextually appropriate sequences of gestures conditioned on speech input. This research direction has attracted widespread attention in fields such as computer graphics and computer vision, holding significant application value in domains including film animation production, human-computer interaction, and virtual reality. Early rule-based methods suffer from inefficiency, while regression methods, despite improving generation efficiency, often result in gestures with repetitive motion patterns and limited expressiveness. In recent years, generative models have further advanced this field, effectively enhancing the quality and diversity of generated gestures. Regarding speech-driven gesture generation methods based on generative models, this work summarizes and categorizes relevant research on generative adversarial networks, variational autoencoders, and diffusion models, analyzing their respective applications, advantages, and disadvantages in gesture generation. It further explores the controllability of speech-driven gesture generation in emotion expression, semantic consistency, and style transfer. Moreover, collaborative generation research combining facial expressions and gestures is discussed. Additionally, commonly used datasets and evaluation metrics are introduced, followed by experimental comparative analysis of representative methods. Finally, the paper concludes by summarizing the challenges in the field of speech-driven gesture generation and outlining future research trends.

Key words: gesture generation, speech-driven, generative models, style control

张亚宇, 温玉辉, 张欣雨, 景丽萍. 语音驱动手势动作生成前沿进展[J]. 计算机科学与探索, DOI: 10.3778/j.issn.1673-9418.2505081.

ZHANG Yayu, WEN Yuhui, ZHANG Xinyu, JING Liping. Recent Advances in Speech-Driven Gesture Generation[J]. Journal of Frontiers of Computer Science and Technology, DOI: 10.3778/j.issn.1673-9418.2505081.

[1]	许璧麒, 马志强, 周钰童, 贾文超, 刘佳, 吕凯. 知识驱动的对话生成模型研究综述[J]. 计算机科学与探索, 2024, 18(1): 58-74.
[2]	张志远, 陈亚瑞, 杨剑宁, 丁文强, 杨巨成. 熵正则化下的变分深度生成聚类模型[J]. 计算机科学与探索, 2023, 17(2): 376-384.
[3]	刘靖祎, 史彩娟, 涂冬景, 刘帅. 零样本图像分类综述[J]. 计算机科学与探索, 2021, 15(5): 812-824.
[4]	武随烁，杨金福，单义，许兵兵. 使用孪生注意力机制的生成对抗网络的研究[J]. 计算机科学与探索, 2020, 14(5): 833-840.
[5]	吴少乾，李西明. 生成对抗网络的研究进展综述[J]. 计算机科学与探索, 2020, 14(3): 377-388.

语音驱动手势动作生成前沿进展

Recent Advances in Speech-Driven Gesture Generation

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics