Survey on Applications of AIGC in Multimodal Scenarios

doi:10.3778/j.issn.1673-9418.2404009

Abstract

Abstract: Although artificial intelligence generated content (AIGC) has been able to achieve excellent results in the field of single-mode applications, using artificial intelligence to generate text, images, videos and other content, it is difficult for a single-mode feature representation to completely contain the complete information of a phenomenon. In order to enable AIGC to show greater generation capability, scholars propose applying multimodal information into AIGC to improve the learning performance and generation capability of models. By processing and integrating multiple modalities, AIGC acquires richer contextual information, which helps models better understand and generate content. The basic architecture, working principle and challenge of AIGC in dealing with multimodal problems are discussed in detail, and the AIGC models combined with multimodal information in recent years are classified and summarized. The application, challenge and development direction of AIGC in multimodal image generation, video generation and 3D shape generation are summarized. In the aspect of image generation, the application and limitation of generative adversarial network (GAN) model and diffusion model are discussed. In the aspect of video generation, the video generation based on diffusion model is analyzed, and the audio and video joint generation method is discussed. In the aspect of 3D shape generation, the 3D shape generation method under the guidance of diffusion model and neural network is discussed. The challenges faced by AIGC in multimodal applications are proposed, and the future research is prospected.

Key words: artificial intelligence generated content (AIGC), multimodal, large language model

摘要： 虽然生成式人工智能（AIGC）已经能够在单一模态应用领域取得优异成果，可以利用人工智能技术生成文字、图像、视频等内容，但单一模态的特征表示很难完整包含某个现象的完整信息。为了提高模型的学习性能和生成能力，学者们提出将多模态信息应用在AIGC中。AIGC能够对输入的多模态信息进行融合，获取更丰富的上下文信息，帮助模型更好地理解和生成内容。深入探讨了AIGC处理多模态问题的基本架构、工作原理和挑战，并对近年来与多模态信息结合的AIGC模型进行了分类和归纳。总结了AIGC在多模态图像生成、视频生成、三维形状生成等方面的应用、挑战和发展方向。在图像生成方面，讨论了生成对抗网络（GAN）模型、扩散模型等技术的应用和局限性。在视频生成方面，分析了基于扩散模型的视频生成技术，并探讨了音视频联合生成的方法。在三维形状生成方面，探讨了扩散模型和神经网络指导下的三维形状生成方法。最后提出了AIGC面临的挑战与未来潜在的研究方法。

关键词: 生成式人工智能（AIGC）, 多模态, 大语言模型

YUE Qi, ZHANG Chenkang. Survey on Applications of AIGC in Multimodal Scenarios[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(1): 79-96.

岳颀, 张晨康. 多模态场景下AIGC的应用综述[J]. 计算机科学与探索, 2025, 19(1): 79-96.

References

[1] BROOKS T, HOLYNSKI A, EFROS A A. InstructPix2Pix: learning to follow image editing instructions[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 18392-18402.
[2] BENGIO Y, COURVILLE A, VINCENT P. Representation learning: a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.
[3] ANTOL S, AGRAWAL A, LU J, et al. VQA: visual question answering[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2015: 2425-2433.
[4] WANG J, SHEN H T, SONG J, et al. Hashing for similarity search: a survey[EB/OL]. [2024-03-02]. https://arxiv.org/abs/1408.2927.
[5] KARPATHY A, JOULIN A, LI F F, et al. Deep fragment embeddings for bidirectional image sentence mapping[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2. New York: ACM, 2014: 1889-1897.
[6] BUBECK S, MUNOS R, STOLTZ G, et al. Online optimization in X-armed bandits[C]//Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, 2008: 201-208.
[7] ZENG Z, PANTIC M, ROISMAN G I, et al. A survey of affect recognition methods: audio, visual, and spontaneous expressions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(1): 39-58.
[8] YIN Y J, MENG F D, SU J S, et al. A novel graph-based multi-modal fusion encoder for neural machine translation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3025-3035.
[9] SPECIA L, FRANK S, SIMA’AN K, et al. A shared task on multimodal machine translation and CrosslingualImage description[C]//Proceedings of the 1st Conference on Machine Translation. Stroudsburg: ACL, 2016: 543-553.
[10] EHRESMANN A C, BÉJEAN M, VANBREMEERSCH J P. A mathematical framework for enriching human-machine interactions[J]. Machine Learning and Knowledge Extraction, 2023, 5(2): 597-610.
[11] RADFORD A, NARASIMHAN K. Improving language under-standing by generative pre-training[EB/OL]. (2018)[2024-03-10]. https://api.semanticscholar.org/CorpusID:49313245.
[12] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[13] SUN Y, ZHENG Y, HAO C, et al. NSP-BERT: a prompt-based zero-shot learner through an original pre-training task-next sentence prediction[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 3233-3250.
[14] XIA W H, ZHANG Y L, YANG Y J, et al. GAN inversion: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 45(3): 3121-3138.
[15] LEE S H, BAE S H. AFI-GAN: improving feature interpolation of feature pyramid networks via adversarial training for object detection[J]. Pattern Recognition, 2023, 138: 109365.
[16] SOLANO-CARRILLO E, RODRIGUEZ A B, CARRILLO-PEREZ B, et al. Look ATME: the discriminator mean entropy needs attention[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2023: 787-796.
[17] RANGWANI H, BANSAL L, SHARMA K, et al. Noisy-Twins: class-consistent and diverse image generation through StyleGANs[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 5987-5996.
[18] LIU H Y, SONG Y B, CHEN Q F. Delving StyleGAN inversion for image editing: a foundation latent space viewpoint[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 10072-10082.
[19] NGUYEN T H, VAN LE T, TRAN A. Efficient scale-invariant generator with column-row entangled pixel synthesis[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 22408-22417.
[20] ZHANG L, RAO A, AGRAWALA M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 3813-3824.
[21] HUI M D, ZHANG Z Z, ZHANG X Y, et al. Unifying layout generation with a decoupled diffusion model[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 1942-1951.
[22] INOUE N, KIKUCHI K, SIMO-SERRA E, et al. LayoutDM: discrete diffusion model for controllable layout generation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 10167-10176.
[23] NICHOL A, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models[C]//Proceedings of the 2022 International Conference on Machine Learning, 2022: 16784-16804.
[24] DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[EB/OL]. [2024-03-02]. https://arxiv.org/abs/2105.05233.
[25] GO H, LEE Y, KIM J, et al. Towards practical plug-and-play diffusion models[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 1962-1971.
[26] PHUNG H, DAO Q, TRAN A. Wavelet diffusion models are fast and scalable image generators[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 10199-10208.
[27] BAO F, NIE S, XUE K W, et al. All are worth words: a ViT backbone for diffusion models[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 22669-22679.
[28] WANG H, XIANG X Y, FAN Y C, et al. Customizing 360-degree panoramas through text-to-image diffusion models[EB/OL]. [2024-03-02]. https://arxiv.org/abs/2310.18840.
[29] MENG C, ROMBACH R, GAO R Q, et al. On distillation of guided diffusion models[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14297-14306.
[30] TAKAGI Y, NISHIMOTO S. High-resolution image reconstruction with latent diffusion models from human brain activity[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14453-14463.
[31] PAN X, TEWARI A, LEIMKÜHLER T, et al. Drag your GAN: interactive point-based manipulation on the generative image manifold[C]//Proceedings of the ACM SIGGRAPH 2023 Conference. New York: ACM, 2023: 1-11.
[32] CHEN X Y, LIU Z J, TANG H T, et al. SparseViT: revisiting activation sparsity for efficient high-resolution vision transformer[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 2061-2070.
[33] DONG B, WANG P C, WANG F. Head-free lightweight semantic segmentation with linear transformer[J]. Proceedings of the 37th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2023: 516-524.
[34] HO J, SALIMANS T, GRITSENKO A, et al. Video diffusion models[EB/OL]. [2024-03-05]. https://arxiv.org/abs/2204.03458.
[35] HARVEY W, NADERIPARIZI S, MASRANI V, et al. Flexible diffusion modeling of long videos[EB/OL]. [2024-03-05]. https://arxiv.org/abs/2205.11495.
[36] BLATTMANN A, DOCKHORN T, KULAL S, et al. Stable video diffusion: scaling latent video diffusion models to large datasets[EB/OL]. [2024-03-05]. https://arxiv.org/abs/2311.15127.
[37] SINGER U, POLYAK A, HAYES T, et al. Make-A-video: text-to-video generation without text-video data[EB/OL]. [2024-03-06]. https://arxiv.org/abs/2209.14792.
[38] ZHOU D Q, WANG W M, YAN H S, et al. MagicVideo: efficient video generation with latent diffusion models[EB/OL]. [2024-03-06]. https://arxiv.org/abs/2211.11018.
[39] WANG J N, YUAN H J, CHEN D Y, et al. ModelScope text-to-video technical report[EB/OL]. [2024-03-06]. https://arxiv.org/abs/2308.06571.
[40] YANG R H, SRIVASTAVA P, MANDT S. Diffusion probabilistic modeling for video generation[J]. Entropy, 2023, 25(10): 1469.
[41] LUO Z X, CHEN D Y, ZHANG Y Y, et al. Notice of removal: videofusion: decomposed diffusion models for high-quality video generation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 10209-10218.
[42] WENG Y T, HAN M F, HE H Y, et al. Mask propagation for efficient video semantic segmentation[EB/OL]. [2024-03-06]. https://arxiv.org/abs/2310.18954.
[43] YIN S M, WU C F, YANG H, et al. NUWA-XL: diffusion over diffusion for extremely long video generation[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 1309-1320.
[44] CROITORU F A, HONDRU V, IONESCU R T, et al. Diffusion models in vision: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(9): 10850-10869.
[45] GE S W, NAH S, LIU G L, et al. Preserve your own correlation: a noise prior for video diffusion models[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 22873-22884.
[46] GUO Y W, YANG C Y, RAO A Y, et al. AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning[C]//Proceedings of the 12th International Conference on Learning Representations, 2024.
[47] BLATTMANN A, ROMBACH R, LING H, et al. Align your latents: high-resolution video synthesis with latent diffusion models[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 22563-22575.
[48] WANG Y H, CHEN X Y, MA X, et al. LAVIE: high-quality video generation with cascaded latent diffusion models[EB/OL]. [2024-03-08]. https://arxiv.org/abs/2309.15103.
[49] MEI K F, PATEL V M. VIDM: video implicit diffusion models[EB/OL]. [2024-03-08]. https://arxiv.org/abs/2212.00235.
[50] YU S Y, SOHN K, KIM S, et al. Video probabilistic diffusion models in projected latent space[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 18456-18466.
[51] XING Z, DAI Q, HU H, et al. SVFormer: semi-supervised video transformer for action recognition[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 18816-18826.
[52] VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet: a generative model for raw audio[C]//Proceedings of the 9th ISCA Speech Synthesis Workshop, 2016: 125.
[53] ENGEL J, AGRAWAL K K, CHEN S, et al. GANSynth: adversarial neural audio synthesis[C]//Proceedings of the 7th International Conference on Learning Representations, 2018.
[54] RUAN L D, MA Y Y, YANG H, et al. MM-diffusion: learning multi-modal diffusion models for joint audio and video generation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 10219-10228.
[55] SU K, QIAN K Z, SHLIZERMAN E, et al. Physics-driven diffusion models for impact sound synthesis from videos[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 9749-9759.
[56] ZHANG W X, CUN X D, WANG X, et al. SadTalker: learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 8652-8661.
[57] WANG J D, QIAN X Y, ZHANG M L, et al. Seeing what you said: talking face generation guided by a lip reading expert[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14653-14662.
[58] GUO R H, YING X, CHEN Y R, et al. Audio-visual instance segmentation[EB/OL]. [2024-03-09]. https://arxiv.org/abs/2310.18709.
[59] ANCIUKEVIČIUS T, XU Z, FISHER M, et al. Render-Diffusion: image diffusion for 3D reconstruction, inpainting and generation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 12608-12618.
[60] POOLE B, JAIN A, BARRON J T, et al. DreamFusion: text-to-3D using 2D Diffusion[C]//Proceedings of the 11th International Conference on Learning Representations, 2023.
[61] SEO J, JANG W, KWAK M S, et al. Let 2D diffusion model know 3D-consistency for robust text-to-3D generation[EB/OL]. [2024-03-09]. https://arxiv.org/abs/2303.07937.
[62] YU X, GUO Y C, LI Y G, et al. Text-to-3D with classifier score distillation[C]//Proceedings of the 12th International Conference on Learning Representations, 2024.
[63] WANG Z Y, LU C, WANG Y K, et al. ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation[EB/OL]. [2024-03-12]. https://arxiv.org/abs/2305.16213.
[64] WANG H C, DU X D, LI J H, et al. Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 12619-12629.
[65] CHUNG H, RYU D, MCCANN M T, et al. Solving 3D inverse problems using pre-trained 2D diffusion models[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 22542-22551.
[66] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 10674-10685.
[67] ZHANG J B, DONG R P, MA K S. CLIP-FO3D: learning free open-world 3D scene representations from 2D dense CLIP[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE, 2023: 2040-2051.
[68] KIM S W, BROWN B, YIN K, et al. NeuralField-LDM: scene generation with hierarchical latent diffusion models[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 8496-8506.
[69] LIN C H, GAO J, TANG L M, et al. Magic3D: high-resolution text-to-3D content creation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 300-309.
[70] FRIDMAN R, ABECASIS A, KASTEN Y, et al. Scene-Scape: text-driven consistent scene generation[EB/OL]. [2024-03-12]. https://arxiv.org/abs/2302.01133.
[71] CHENG Y C, LEE H Y, TULYAKOV S, et al. SDFusion: multimodal 3D shape completion, reconstruction, and generation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 4456-4465.
[72] DENG K, YANG G, RAMANAN D, et al. 3D-aware conditional image synthesis[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 4434-4445.
[73] WANG C, CHAI M L, HE M M, et al. CLIP-NeRF: text-and-image driven manipulation of neural radiance fields[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 3825-3834.
[74] KANAOKA D, SONOGASHIRA M, TAMUKOH H, et al. ManifoldNeRF: view-dependent image feature supervision for few-shot neural radiance fields[EB/OL]. [2024-03-12]. https://arxiv.org/abs/2310.13670.
[75] LI G, ZHENG H H, WANG C Y, et al. 3DDesigner: towards photorealistic 3D object generation and editing with text-guided diffusion models[EB/OL]. [2024-03-15]. https://arxiv.org/abs/2211.14108.
[76] BAUTISTA M A, GUO P, ABNAR S, et al. GAUDI: a neural architect for immersive 3D scene generation[EB/OL]. [2024-03-15]. https://arxiv.org/abs/2207.13751.
[77] TOSI F, TONIONI A, DE GREGORIO D, et al. NeRF-supervised deep stereo[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 855-866.
[78] ZHANG Z C, LIU Y L, HAN C Y, et al. Transforming radiance field with lipschitz network for photorealistic 3D scene stylization[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 20712-20721.
[79] TERTIKAS K, PASCHALIDOU D, PAN B X, et al. Generating part-aware editable 3D shapes without 3D supervision[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 4466-4478.
[80] CHEN Z, FUNKHOUSER T, HEDMAN P, et al. Mobile-NeRF: exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 16569-16578.
[81] RAJ A, KAZA S, POOLE B, et al. DreamBooth3D: subject-driven text-to-3D generation[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 2349-2359.
[82] JAIN A, MILDENHALL B, BARRON J T, et al. Zero-shot text-guided object generation with dream fields[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 857-866.
[83] COHEN-BAR D, RICHARDSON E, METZER G, et al. Set-the-scene: global-local training for generating controllable NeRF scenes[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE, 2023: 2912-2921.
[84] VAN HOLLAND L, STOTKO P, KRUMPEN S, et al. Efficient 3D reconstruction, streaming and visualization of static and dynamic scene parts for multi-client live-telepresence in large-scale environments[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE, 2023: 4260-4274.
[85] XUE L, GAO M F, XING C, et al. ULIP: learning a unified representation of language, images, and point clouds for 3D understanding[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 1179-1189.
[86] MELAS-KYRIAZI L, RUPPRECHT C, VEDALDI A. PC2: projection-conditioned point cloud diffusion for single-image 3D reconstruction[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 12923-12932.
[87] HÖLLEIN L, CAO A, OWENS A, et al. Text2Room: extracting textured 3D meshes from 2D text-to-image models[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 7875-7886.
[88] REN S H, DING Y K, LIAO J L, et al. Volumetric 3D reconstruction with window-wise global feature aggregation[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5.
[89] AN S Z, XU H Y, SHI Y C, et al. PanoHead: geometry-aware 3D full-head synthesis in 360°[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 20950-20959.
[90] WANG W, BI B, YAN M, et al. StructBERT: incorporating language structures into pre-training for deep language understanding[EB/OL]. [2024-03-18]. https://arxiv.org/abs/1908.04577.
[91] DUAN J H, KONG F, WANG S Q, et al. Are diffusion models vulnerable to membership inference attacks[C]//Proceedings of the 2023 International Conference on Machine Learning, 2023: 8717-8730.
[92] CARLINI N, HAYES J, NASR M, et al. Extracting training data from diffusion models[EB/OL]. [2024-03-18]. https://arxiv.org/abs/2301.13188.
[93] YANG Z Q, ZHANG J, CHANG E C, et al. Neural network inversion in adversarial setting via background knowledge alignment[C]//Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2019: 225-240.
[94] TRAMÈR F, ZHANG F, JUELS A, et al. Stealing machine learning models via prediction APIs[C]//Proceedings of the 25th USENIX Security Symposium. Berkeley: USENIX, 2016: 601-618.
[95] FREDRIKSON M, JHA S, RISTENPART T. Model inversion attacks that exploit confidence information and basic countermeasures[C]//Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2015: : 1322-1333.
[96] MAKKAR A, GHOSH U, SHARMA P K, et al. A fuzzy-based approach to enhance cyber defence security for next-generation IoT[J]. IEEE Internet of Things Journal, 2023, 1(1): 2079-2086.
[97] XU J, WU Z, WANG C, et al. Machine unlearning: solutions and challenges[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2024, 8(3): 2150-2168.
[98] WANG W X, YIN B J, YAO T P, et al. Delving into data: effectively substitute training for black-box attack[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 4759-4768.
[99] KURITA K, MICHEL P, NEUBIG G. Weight poisoning attacks on pre-trained models[EB/OL]. [2024-03-02]. https://arxiv.org/abs/2004.06660.
[100] LI L Y, SONG D M, LI X N, et al. Backdoor attacks on pre-trained models by layerwise weight poisoning[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021: 3023-3032.
[101] JIN D, JIN Z J, ZHOU J T, et al. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8018-8025.
[102] JIA J Y, LIU Y P, HU Y P, et al. PORE: provably robust recommender systems against data poisoning attacks[EB/OL]. [2024-03-18]. https://arxiv.org/abs/2303.14601.
[103] CAO X Y, ZHANG Z X, JIA J Y, et al. FLCert: provably secure federated learning against poisoning attacks[J]. IEEE Transactions on Information Forensics and Security, 2022, 17: 3691-3705.
[104] BELTAGY I, PETERS M E, COHAN A. Longformer: the long-document transformer[EB/OL]. [2024-03-18]. https://arxiv.org/abs/2004.05150.
[105] LAGUNAS F, CHARLAIX E, SANH V, et al. Block pruning for faster transformers[EB/OL]. [2024-03-20]. https://arxiv.org/abs/2109.04838.
[106] SUN M J, LIU Z, BAIR A, et al. A simple and effective pruning approach for large language models[C]//Proceedings of the 12th International Conference on Learning Representations, 2024.
[107] CHENG Y, WANG D, ZHOU P, et al. A survey of model compression and acceleration for deep neural networks[EB/OL]. [2024-03-20]. https://arxiv.org/abs/ 1710.09282.
[108] LI Y G, LIANG F, ZHAO L C, et al. Supervision exists everywhere: a data efficient contrastive language-image pretraining paradigm[EB/OL]. [2024-03-20]. https://arxiv.org/abs/abs/2110.05208.
[109] KIM W, SON B, KIM I. ViLT: vision-and-language transformer without convolution or region supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 5583-5594.
[110] YAN H, DENG B C, LI X N, et al. TENER: adapting transformer encoder for named entity recognition[EB/OL]. [2024-03-20]. https://arxiv.org/abs/1911.04474.
[111] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 8748-8763.
[112] HUANG S H, DONG L, WANG W H, et al. Language is not all you need: aligning perception with language models[EB/OL]. [2024-03-22]. https://arxiv.org/abs/2302.14045.
[113] DRIESS D, XIA F, SAJJADI M S M, et al. PaLM-E: an embodied multimodal language model[C]//Proceedings of the 2023 International Conference on Machine Learning, 2023: 8469-8488.