Journal of Frontiers of Computer Science and Technology ›› 2025, Vol. 19 ›› Issue (1): 79-96.DOI: 10.3778/j.issn.1673-9418.2404009

• Constructions and Applications of Large Language Models • Previous Articles     Next Articles

Survey on Applications of AIGC in Multimodal Scenarios

YUE Qi, ZHANG Chenkang   

  1. School of Automation, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
  • Online:2025-01-01 Published:2024-12-31

多模态场景下AIGC的应用综述

岳颀,张晨康   

  1. 西安邮电大学 自动化学院,西安 710121

Abstract: Although artificial intelligence generated content (AIGC) has been able to achieve excellent results in the field of single-mode applications, using artificial intelligence to generate text, images, videos and other content, it is difficult for a single-mode feature representation to completely contain the complete information of a phenomenon. In order to enable AIGC to show greater generation capability, scholars propose applying multimodal information into AIGC to improve the learning performance and generation capability of models. By processing and integrating multiple modalities, AIGC acquires richer contextual information, which helps models better understand and generate content. The basic architecture, working principle and challenge of AIGC in dealing with multimodal problems are discussed in detail, and the AIGC models combined with multimodal information in recent years are classified and summarized. The application, challenge and development direction of AIGC in multimodal image generation, video generation and 3D shape generation are summarized. In the aspect of image generation, the application and limitation of generative adversarial network (GAN) model and diffusion model are discussed. In the aspect of video generation, the video generation based on diffusion model is analyzed, and the audio and video joint generation method is discussed. In the aspect of 3D shape generation, the 3D shape generation method under the guidance of diffusion model and neural network is discussed. The challenges faced by AIGC in multimodal applications are proposed, and the future research is prospected.

Key words: artificial intelligence generated content (AIGC), multimodal, large language model

摘要: 虽然生成式人工智能(AIGC)已经能够在单一模态应用领域取得优异成果,可以利用人工智能技术生成文字、图像、视频等内容,但单一模态的特征表示很难完整包含某个现象的完整信息。为了提高模型的学习性能和生成能力,学者们提出将多模态信息应用在AIGC中。AIGC能够对输入的多模态信息进行融合,获取更丰富的上下文信息,帮助模型更好地理解和生成内容。深入探讨了AIGC处理多模态问题的基本架构、工作原理和挑战,并对近年来与多模态信息结合的AIGC模型进行了分类和归纳。总结了AIGC在多模态图像生成、视频生成、三维形状生成等方面的应用、挑战和发展方向。在图像生成方面,讨论了生成对抗网络(GAN)模型、扩散模型等技术的应用和局限性。在视频生成方面,分析了基于扩散模型的视频生成技术,并探讨了音视频联合生成的方法。在三维形状生成方面,探讨了扩散模型和神经网络指导下的三维形状生成方法。最后提出了AIGC面临的挑战与未来潜在的研究方法。

关键词: 生成式人工智能(AIGC), 多模态, 大语言模型