[1] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[EB/OL]. [2024-12-04]. https://arxiv.org/abs/2006. 11239.
[2] SONG J M, MENG C L, ERMON S. Denoising diffusion implicit models[EB/OL]. [2024-12-04]. https://arxiv.org/abs/2010.02502.
[3] SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations[EB/OL]. [2024-12-04]. https://arxiv.org/abs/ 2011.13456.
[4] RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. [2024-12-04]. https://arxiv.org/abs/2204.06125.
[5] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 10674-10685.
[6] SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding[EB/OL]. [2024-12-04]. https://arxiv.org/abs/2205. 11487.
[7] 黄万鑫, 任英杰, 芦天亮, 等. 面向扩散大模型的多模态人脸生成方法[J]. 计算机科学与探索, 2025, 19(10): 2815-2830.
HUANG W X, REN Y J, LU T L, et al. Multimodal face generation method for diffusion large models[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(10): 2815-2830.
[8] RUIZ N, LI Y Z, JAMPANI V, et al. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 22500-22510.
[9] MENG C L, HE Y T, SONG Y, et al. SDEdit: guided image synthesis and editing with stochastic differential equations[EB/OL]. [2024-12-04]. https://arxiv.org/abs/2108.01073.
[10] AVRAHAMI O, FRIED O, LISCHINSKI D. Blended latent diffusion[J]. ACM Transactions on Graphics, 2023, 42(4): 1-11.
[11] HERTZ A, MOKADY R, TENENBAUM J, et al. Prompt-to-prompt image editing with cross attention control[EB/OL]. [2024-12-05]. https://arxiv.org/abs/2208.01626.
[12] BLATTMANN A, ROMBACH R, LING H, et al. Align your latents: high-resolution video synthesis with latent diffusion models[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 22563-22575.
[13] WU J Z, GE Y X, WANG X T, et al. Tune-A-Video: one-shot tuning of image diffusion models for text-to-video generation[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 7589-7599.
[14] WU R Q, CHEN L Y, YANG T, et al. LAMP: learn a motion pattern for few-shot video generation[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 7089-7098.
[15] ZHANG L M, RAO A Y, AGRAWALA M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 3813-3824.
[16] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[17] ZHANG H, XU T, LI H S, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5908-5916.
[18] XU T, ZHANG P C, HUANG Q Y, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 1316-1324.
[19] VAN DEN OORD A, VINYALS O. Neural discrete representation learning[C]//Advances in Neural Information Processing Systems 30, 2017: 6306-6315.
[20] SOHN K, LEE H, YAN X. Learning structured output representation using deep conditional generative models[C]//Advances in Neural Information Processing Systems 28, 2015: 3483-3491.
[21] NICHOL A, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models[EB/OL]. [2024-12-05]. https://arxiv.org/abs/2112.10741.
[22] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 8748-8763.
[23] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008.
[24] KIM Y, LEE J, KIM J H, et al. Dense text-to-image generation with attention modulation[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 7667-7677.
[25] GAL R, ALALUF Y, ATZMON Y, et al. An image is worth one word: personalizing text-to-image generation using textual inversion[EB/OL]. [2024-12-06]. https://arxiv.org/abs/2208.01618.
[26] SINGER U, POLYAK A, HAYES T, et al. Make-A-Video: text-to-video generation without text-video data[EB/OL]. [2024-12-06]. https://arxiv.org/abs/2209.14792.
[27] ZHOU D Q, WANG W M, YAN H S, et al. MagicVideo: efficient video generation with latent diffusion models[EB/OL]. [2024-12-06]. https://arxiv.org/abs/2211.11018.
[28] WANG X, YUAN H J, ZHANG S W, et al. VideoComposer: compositional video synthesis with motion controllability[EB/OL]. [2024-12-06]. https://arxiv.org/abs/2306.02018.
[29] HONG W Y, DING M, ZHENG W D, et al. CogVideo: large-scale pretraining for text-to-video generation via transformers[EB/OL]. [2024-12-06]. https://arxiv.org/abs/2205.15868.
[30] GUO Y W, YANG C Y, RAO A Y, et al. AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning[EB/OL]. [2024-12-06]. https://arxiv.org/abs/2307.04725.
[31] BAIN M, NAGRANI A, VAROL G, et al. Frozen in time: a joint video and image encoder for end-to-end retrieval[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 1708-1718.
[32] XUE H W, HANG T K, ZENG Y H, et al. Advancing high-resolution video-language representation with large-scale video transcriptions[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5026-5035.
[33] HUANG N S, ZHANG Y X, DONG W M. Style-A-Video: agile diffusion for arbitrary text-based video style transfer[J]. IEEE Signal Processing Letters, 2024, 31: 1494-1498.
[34] ZHANG Y B, WEI Y X, JIANG D S, et al. ControlVideo: training-free controllable text-to-video generation[EB/OL]. [2024-12-06]. https://arxiv.org/abs/2305.13077.
[35] KHACHATRYAN L, MOVSISYAN A, TADEVOSYAN V, et al. Text2Video-Zero: text-to-image diffusion models are zero-shot video generators[C]//Proceedings of the 2023 IEEE/ CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 15908-15918.
[36] CHEN W F, WU J, XIE P, et al. Control-A-Video: controllable text-to-video generation with diffusion models[EB/OL]. [2024-12-06]. https://arxiv.org/abs/2305.13840.
[37] ZHANG D J, LI D X, LE H, et al. MoonShot: towards controllable video generation and editing with multimodal conditions[EB/OL]. [2024-12-06]. https://arxiv.org/abs/2401. 01827.
[38] ESSER P, CHIU J, ATIGHEHCHIAN P, et al. Structure and content-guided video synthesis with diffusion models[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 7312-7322.
[39] KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 3992-4003.
[40] ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 586-595. |