Journal of Frontiers of Computer Science and Technology ›› 2025, Vol. 19 ›› Issue (11): 2994-3006.DOI: 10.3778/j.issn.1673-9418.2502058

• Graphics·Image • Previous Articles     Next Articles

Few-Shot Motion Pattern Learning and Video Generation Control Strategy

FENG Sicong, PENG Li   

  1. Ministry of Education Engineering Research Center for the Application of Internet of Things Technology, School of Internet of Things Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2025-11-01 Published:2025-10-30

少样本运动模式学习与视频生成控制策略

冯思聪,彭力   

  1. 江南大学 物联网工程学院 物联网技术应用教育部工程研究中心,江苏 无锡 214122

Abstract: With the rapid development of deep learning, diffusion models have shown remarkable success in both image and video generation, making text-driven video generation a growing research focus. However, existing text-to-video generation models face significant challenges. The high computational costs and extensive data requirements substantially limit their practical applications, while the lack of emphasis on interactive editing capabilities makes it difficult to meet diverse generation needs. To address these issues, a dynamic mask-guided video generation network (DyMask-Vid) is proposed, leveraging few-shot learning and enabling model training on consumer-grade GPUs with only a small amount of video data. Specifically, a text-aware masked cross-attention (TAMCA) mechanism is introduced to strengthen attention to foreground regions during training, thereby achieving precise alignment between text prompts and video content. At the same time, in order to adapt to the first-frame generation strategy, the model incorporates a temporal-spatial self-attention (TSSA) layer. In the generation phase, dynamic mask combined with text prompts provides fine-grained control over the output. Additionally, a temporal noise sharing strategy (TNSS) is designed for the inference stage to enhance the stability of generated videos. Extensive qualitative and quantitative experiments demonstrate that this method delivers superior performance in both video generation and editing tasks, significantly outperforming existing approaches in terms of consistency and generation quality.

Key words: video generation, diffusion model, dynamic mask, few-shot learning

摘要: 随着深度学习的迅速发展,扩散模型在图像生成与视频生成领域取得了显著成效,使文本驱动的视频生成逐渐成为研究热点。然而,现有的文本生成视频模型仍面临诸多挑战,例如,高昂的训练成本与庞大的数据需求限制了模型的应用场景,同时在视频内容的交互式编辑能力方面关注不足,难以满足多样化的生成需求。针对以上挑战,提出一种通过少样本学习的动态掩码引导视频生成网络(DyMask-Vid),仅需少量视频数据就可在消费级GPU上完成训练。该模型在现有模型中引入了文本感知掩码交叉注意力层(TAMCA),以增强训练过程中对视频前景区域的注意力,从而实现文本提示与视频内容的精确匹配,同时为了适应首帧生成策略,该模型加入了时空自注意力层(TSSA)。在生成阶段,模型通过动态掩码和文本提示实现对生成内容的精确控制,同时为了提高生成视频的稳定性,设计了一种时序共享噪声策略(TNSS)。通过广泛的定性和定量实验验证,该方法在视频生成和编辑任务中表现出色,在一致性和生成质量方面显著优于现有的一些方法。

关键词: 视频生成, 扩散模型, 动态掩码, 少样本学习