Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

Application of AIGC in multimodal scenarios: A survey

YUE Qi, ZHANG Chenkang   

  1. Xi'an University of Posts and Telecommunications,School of Automation, Xi'an,710121, China

多模态场景下AIGC的应用综述

岳颀,张晨康   

  1. 西安邮电大学 自动化学院,西安 710121

Abstract: Although Artificial Intelligence Generated Content(AIGC) has been able to achieve excellent results in the field of single-mode applications, using artificial intelligence to generate text, images, videos and other content, research has found that it is difficult for a single-mode feature representation to completely contain the complete information of a phenomenon. In order to enable AIGC to show greater generation capability, scholars began to propose applying multimodal information to AIGC to improve the learning performance and generation capability of models. By processing and integrating multiple modalities, AIGC acquires richer contextual information, which helps models better understand and generate content. The basic architecture, working principle and challenge of AIGC in dealing with multimodal problems are discussed in detail, and the AIGC models combined with multimodal information in recent years are classified and summarized. The application, challenge and development direction of generative artificial intelligence in multimodal the image generation, the video generation and the 3D shape generation are summarized. In the aspect of image generation, the application and limitation of GAN model and diffusion model are discussed. In the aspect of video generation, the technology of video generation based on diffusion model is analyzed, and the method of audio and video joint generation is discussed. In the aspect of 3D shape generation, the method of 3D shape generation under the guidance of diffusion model and neural network is discussed. The challenges faced by AIGC in multimodal applications are discussed, and the future research is prospected.

Key words: Artificial Intelligence Generated Content, multimodal;model

摘要: 尽管生成式人工智能(AIGC)已经能够在单一模态应用领域取得了优异成果,利用人工智能生成文字、图像、视频等内容,但研究发现单一模态的特征表示很难完整包含某个现象的完整信息,为了使得AIGC 展现更加强大的生成能力,学者们开始提出将多模态信息应用在AIGC中,提高模型的学习性能和生成能力。AIGC将输入的多模态信息进行处理,融合多种模态信息,获取更丰富的上下文信息,帮助模型更好地理解和生成内容。深入探讨了AIGC处理多模态问题的基本架构、工作原理和挑战,并对近年来和多模态信息结合的AIGC模型进行了分类和归纳。总结了生成式人工智能在多模态图像生成、视频生成、三维形状生成等方面的应用、挑战和发展方向。在图像生成方面,讨论了GAN模型、扩散模型等技术的应用和局限性。在视频生成方面,分析了基于扩散模型的视频生成技术,并探讨了音视频联合生成的方法。在三维形状生成方面,探讨了扩散模型和神经网络指导下的三维形状生成方法。最后讨论其面临的挑战与潜在的未来研究方法。

关键词: 生成式人工智能, 多模态, 模型