Survey of AI Painting

doi:10.3778/j.issn.1673-9418.2401075

Abstract

Abstract: AI painting, as a popular research direction in the field of computer vision, is expanding its application boundaries in the fields of art creation, film and media, industrial design, and art education through natural language processing, graphic pre-training models, and diffusion models. Two types of AI painting, namely, image-to-image and text-to-image, are taken as the main lines, and the representative models and their key technologies and methods are analyzed in depth. For the image-to-image, the development lineage, generation principle, and advantages and disadvantages of each model are explored from two types of models based on AE and GAN, and their effects on the public dataset are summarized. For the text-to-image, the structural differences of the three types of models based on diffusion model and other models, as well as the generation effects of various types of models on three datasets are summarized. It is pointed out that the text-to-image utilizing the diffusion model has become a hot topic nowadays, which predicts the diversified development of image generation in the future. And the current mainstream AI painting platforms are compared and summarized from the perspectives of usage and generation speed. Finally, on the basis of summarizing the problems and controversies faced by AI painting at the technical and social levels, future trends such as the complementary development of AI painting and human artists, the increased interactivity of the painting process, and the emergence of new professions and industries are envisioned.

Key words: AI painting, image-to-image, text-to-image, image generation, artificial intelligence generated content (AIGC)

摘要： AI绘画，作为计算机视觉领域的热门研究方向，正通过自然语言处理技术、图文预训练大模型，以及新兴的扩散模型，不断拓展其在艺术创作、影视媒体、工业设计、艺术教育等领域的应用边界。将以图生图和以文生图两类AI绘画任务作为主线，深入分析了代表性模型及其关键技术和方法。对于以图生图方式，从基于自编码器和基于生成式对抗网络两类模型分别探讨了各自的发展脉络、生成原理以及优缺点，并总结了它们在公共数据集上的效果；对于以文生图方式，归纳了基于扩散模型等三类模型的结构区别，以及在三个数据集上各类模型的生成效果，同时指出利用扩散模型的以文生图方式已成为当下的热点，并预示着未来图像生成方式的多样化发展。对目前主流的AI绘画平台从使用方式、生成速度等角度进行了对比总结。最后在总结AI绘画在技术层面和社会层面所面临的问题与争议的基础上，展望了AI绘画与人类艺术家的互补发展、绘画过程互动性增强以及新职业和产业的出现等未来趋势。

关键词: AI绘画, 以图生图, 以文生图, 图像生成, 人工智能生成内容（AIGC）

ZHANG Zeyu, WANG Tiejun, GUO Xiaoran, LONG Zhilei, XU Kui. Survey of AI Painting[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(6): 1404-1420.

张泽宇, 王铁君, 郭晓然, 龙智磊, 徐魁. AI绘画研究综述[J]. 计算机科学与探索, 2024, 18(6): 1404-1420.

References

[1] 冯强. 人工智能绘画的艺术价值及未来发展研究[D]. 沈阳: 鲁迅美术学院, 2021.
FENG Q. Research on artistic value and future development of artificial intelligence painting[D]. Shenyang: Luxun Academy of Fine Arts, 2021.
[2] 列夫·马诺维奇, 埃马努埃莱·阿列利, 陈卓轩. 列夫·马诺维奇: 人工智能(AI)艺术与美学[J]. 世界电影, 2023(3): 4-24.
MANOVICH L, ARIELLI E, CHEN Z X. Lev Manovich: art and aesthetics of artificial intelligence (AI)[J]. World Cinema, 2023(3): 4-24.
[3] 李白杨, 白云, 詹希旎, 等. 人工智能生成内容(AIGC)的技术特征与形态演进[J]. 图书情报知识, 2023, 40(1): 66-74.
LI B Y, BAI Y, ZHAN X N, et al. The technical features and aromorphosis of artificial intelligence generated content (AIGC)[J]. Library Intelligence Knowledge, 2023, 40(1): 66-74.
[4] GARCIA C. Harold Cohen and AARON—a 40-year collaboration[EB/OL]. (2016-08-23)[2023-03-18]. https://computerhistory.org/blog/harold-cohen-and-aaron-a-40-year-collaboration.
[5] 周飞. 人工智能数字绘画的艺术性思辨[J]. 湖北经济学院学报(人文社会科学版), 2017, 14(7): 14-15.
ZHOU F. Thoughts on artistry of artificial intelligence digital painting[J]. Journal of Hubei University of Economics (Humanities and Social Sciences), 2017, 14(7): 14-15.
[6] COLTON S. The painting fool: stories from building an auto-mated painter[M]//Computers and Creativity. Berlin, Heidelberg: Springer, 2012: 3-38.
[7] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444.
[8] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Advances in Neural Information Processing Systems 27, Montreal, Dec 8-13, 2014: 2672-2680.
[9] MARKOV A A. Extension of the limit theorems of probability theory to a sum of variables connected in a chain[J]. Dynamic Probabilistic Systems, 1971, 1: 552-579.
[10] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020. Red Hook: Curran Associates, 2020: 6840-6851.
[11] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088): 533-536.
[12] BOURLARD H, KAMP Y. Auto-association by multilayer perceptrons and singular value decomposition[J]. Biological Cybernetics, 1988, 59(4/5): 291-294.
[13] KINGMA D P, WELLING M. Auto-encoding variational Bayes[J]. Machine Learning, 2013, 106(9/10): 2979-3024.
[14] LECUN Y, BOSER B, DENKER J S, et al. Backpropagation applied to handwritten zip code recognition[J]. Neural Computation, 1989, 1(4): 541-551.
[15] VINCENT P, LAROCHELLE H, LAJOIE I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion[J]. Journal of Machine Learning Research, 2010, 11(12): 3371-3408.
[16] VAN DEN OORD A, VINYALS KAVUKCUOGLU K. Neural discrete representation learning[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017.Red Hook: Curran Associates, 2017: 6309-6318.
[17] ALTMAN N S. An introduction to kernel and nearest-neighbor nonparametric regression[J]. The American Statistician, 1992, 46(3): 175-185.
[18] VAN DEN OORD A, KALCHBRENNER N, ESPEHOLT L, et al. Pixel recurrent neural networks[C]//Proceedings of the 33rd International Conference on Machine Learning, New York, Jun 19-24, 2016: 1747-1756.
[19] 陈淑環, 韦玉科, 徐乐, 等. 基于深度学习的图像风格迁移研究综述[J]. 计算机应用研究, 2019, 36(8): 2250-2255.
CHEN S H, WEI Y K, XU L, et al. Survey of image style transfer based on deep learning[J]. Application Research of Computers, 2019, 36(8): 2250-2255.
[20] 陈淮源, 张广驰, 陈高, 等. 基于深度学习的图像风格迁移研究进展[J]. 计算机工程与应用, 2021, 57(11): 37-45.
CHEN H Y, ZHANG G C, CHEN G, et al. Research progress of image style transfer based on deep learning[J]. Computer Engineering and Applications, 2021, 57(11): 37-45.
[21] GATYS L A, ECKER A S, BETHGE M. A neural algorithm of artistic style[J]. Computer Vision and Pattern Recognition, 2015, 29(2): 241-250.
[22] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10)[2023-03-21]. https://arxiv.org/abs/1409.1556.
[23] LI C, WAND M. Combining Markov random fields and convolutional neural networks for image synthesis[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2016: 2479-2486.
[24] LIAO J, YAO Y, YUAN L, et al. Visual attribute transfer through deep image analogy[J]. ACM Transactions on Graphics, 2017, 36(4): 120.
[25] JOHNSON J, ALAHI A, LI F F. Perceptual losses for real-time style transfer and super-resolution[C]//Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016: 694-711.
[26] ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2017: 2223-2232.
[27] LI Y, FANG C, YANG J, et al. Universal style transfer via feature transforms[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 385-395.
[28] ISOLA P, ZHU J Y, ZHOU T, et al. Image-to-image translation with conditional adversarial networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2017: 5967-5976.
[29] MIRZA M, OSINDERO S. Conditional generative adversarial nets[EB/OL]. [2023-03-21]. https://arxiv.org/abs/1411.1784.
[30] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the 2015 International Conference on Medical Image Computing and Computer Assisted Intervention. Cham: Springer, 2015: 234-241.
[31] SANAKOYEU A, KOTOVENKO D, LANG S, et al. A style-aware content loss for real-time HD style transfer[C]//Proceedings of the 15th European Conference on Computer Vision.Cham: Springer, 2018: 715-731.
[32] ZHU J Y, ZHANG R, PATHAK D, et al. Toward multimodal image-to-image translation[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017.Red Hook: Curran Associates, 2017: 465-476.
[33] BROCK A, DONAHUE J, SIMONYAN K. Large scale GAN training for high fidelity natural image synthesis[J]. Nature Reviews Physics, 2021, 3(6): 422-440.
[34] KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 4401-4410.
[35] KARRAS T, LAINE S, AITTALA M, et al. Alias-free generative adversarial networks[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 852-863.
[36] SAUER A, SCHWARZ K, GEIGER A. StyleGAN-XL: scaling StyleGAN to large diverse datasets[EB/OL]. (2022-05-05)[2023-07-11]. https://arxiv.org/abs/2202.00273.
[37] JIANG Y, CHANG S, WANG Z. TransGAN: two pure transformers can make one strong GAN, and that can scale up[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 14745-14758.
[38] TANG S. Lessons learned from the training of GANs on artificial datasets[EB/OL]. (2020-07-14)[2023-07-13]. https://arxiv.org/abs/2007.06418.
[39] ZHANG Y, ZHOU P, HUANG Z, et al. Off-policy reinforce-ment learning for efficient and effective GAN architecture search[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 175-192.
[40] WU Y, ZHOU P, WILSON A G, et al. Improving GAN training with probability ratio clipping and sample reweighting[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020. Red Hook: Curran Associates, 2020: 5729-5740.
[41] TRAN N T, TRAN V H, NGUYEN N B, et al. Self-supervised GAN: analysis and improvement with multi-class minimax game[C]//Advances in Neural Information Processing Systems 32, Vancouver, Dec 8-14, 2019: 14761-14772.
[42] GONG X, CHANG S, JIANG Y, et al. AutoGAN: neural architecture search for generative adversarial networks[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 1-13.
[43] TERJéK D. Adversarial lipschitz regularization[C]//Advances in Neural Information Processing Systems 32, Vancouver,Dec 8-14, 2019. Red Hook: Curran Associates, 2019: 1-17.
[44] ZHANG D, KHOREVA A. Progressive augmentation of GANs[C]//Advances in Neural Information Processing Systems 32, Vancouver, Dec 8-14, 2019. Red Hook: Curran Associates, 2019: 6249-6259.
[45] JABRI A, FLEET D J, CHEN T. Scalable adaptive computation for iterative Generation[EB/OL]. (2023-06-14)[2023-07-15]. https://arxiv.org/abs/2212.11972.
[46] HO J, SAHARIA C, CHAN W, et al. Cascaded diffusion models for high fidelity image generation[EB/OL]. (2021-12-17)[2023-07-15]. https://arxiv.org/abs/2106.15282.
[47] KIM D, LAI C H, LIAO W H, et al. Consistency trajectory models: learning probability flow ODE trajectory of diffusion[EB/OL]. (2023-10-01)[2023-11-07]. https://arxiv.org/abs/2310.02279.
[48] NICHOL A, DHARIWAL P. Improved denoising diffusion probabilistic models[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021:8162-8171.
[49] WANG Z, ZHOU P, HUANG Z, et al. Diffusion-GAN: training GANs with diffusion[EB/OL]. (2022-06-05)[2023-06-23]. https://arxiv.org/abs/2206.02262.
[50] WANG Z, JIANG Y, ZHENG H, et al. Patch diffusion: faster and more data-efficient training of diffusion models[EB/OL].(2023-10-18)[2023-11-02]. https://arxiv.org/abs/2304.12526.
[51] DARAS G, DELBRACIO M, TALEBI H, et al. Soft diffusion: score matching for general corruptions[EB/OL]. (2022-10-05)[2023-06-20]. https://arxiv.org/abs/2209.05442.
[52] 赖丽娜, 米瑜, 周龙龙, 等. 生成对抗网络与文本图像生成方法综述[J]. 计算机工程与应用, 2023, 59(19): 21-39.
LAI L N, MI Y, ZHOU L L, et al. Survey about generative adversarial network and text-to-image synthesis[J]. Computer Engineering and Applications, 2023, 59(19): 21-39.
[53] REED S, AKATA Z, YAN X, et al. Generative adversarial text to image synthesis[C]//Proceedings of the 33rd International Conference on Machine Learning, New York, Jun 19-24, 2016: 1060-1069.
[54] RADFORD A, METZ L, CHINTALA S, et al. Unsupervised representation learning with deep convolutional generative adversarial networks[EB/OL]. (2016-01-07)[2023-03-16]. https://arxiv.org/abs/1511.06434.
[55] REED S, AKATA Z, MOHAN S, et al. Learning what and where to draw[C]//Advances in Neural Information Processing Systems 29, Barcelona, Dec 5-10, 2016: 241-250.
[56] KARRAS T, AILA T, LAINE S, et al. Progressive growing of GANs for improved quality, stability, and variation[C]//Proceedings of the 2018 International Conference on Learning Representations. Red Hook: Curran Associates, 2018: 1-26.
[57] ZHANG H, XU T, LI H, et al. StackGAN: text to photorealistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5907-5915.
[58] ZHANG H, XU T, LI H, et al. StackGAN++: realistic image synthesis with stacked generative adversarial networks[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates,2017: 694-711.
[59] YIN G, LIU B, SHENG L, et al. Semantics disentangling for text-to-image generation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 2327-2336.
[60] XU T, ZHANG P, HUANG Q, et al. AttnGAN: finegrained text to image generation with attentional generative adversarial networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2018: 1316-1324.
[61] QIAO T, ZHANG J, XU D, et al. MirrorGAN: learning text-to-image gneration by redescription[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 1505-1514.
[62] RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 8821-8831.
[63] RAZAVI A, VAN DEN OORD A, VINYALS O. Generating diverse high-fidelity images with VQ-VAE-2[C]//Advances in Neural Information Processing Systems 32, Vancouver,Dec 8-14, 2019: 14761-14772.
[64] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[65] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 8748-8763.
[66] PATASHNIK O, WU Z, SHECHTMAN E, et al. StyleCLIP: text-driven manipulation of StyleGAN imagery[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 2085-2094.
[67] CROWSON K, BIDEMAN S, KORNIS D, et al. VQGAN-CLIP: open domain image generaton and editing with natural language guidance[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 88-105.
[68] RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. (2022-04-13)[2023-06-20]. https://arxiv.org/abs/2204.06125.
[69] YANG L, ZHANG Z, SONG Y, et al. Diffusion models: a comprehensive survey of methods and applications[EB/OL].(2023-10-11)[2023-11-02]. https://arxiv.org/abs/2209.00796.
[70] SONG J, MENG C, ERMON S. Denoising diffusion implicit models[EB/OL]. (2022-10-05)[2023-11-02]. https://arxiv.org/abs/2010.02502.
[71] LU C, ZHOU Y, BAO F, et al. DPM-Solver: a fast ODE solver for diffusion probabilistic model sampling in around 10 steps[EB/OL]. (2022-10-13)[2023-04-02]. https://arxiv.org/abs/2206.00927.
[72] LU C, ZHOU Y, BAO F, et al. DPM-Solver++: fast solver for guided sampling of diffusion probabilistic models[EB/OL]. (2023-05-06)[2023-09-22]. https://arxiv.org/abs/2211.01095.
[73] NICHOL A, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models[EB/OL]. (2022-03-08)[2023-09-28]. https://arxiv.org/abs/2112.10741.
[74] SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding[EB/OL]. (2022-05-23)[2023-08-18]. https://arxiv.org/abs/2205.11487.
[75] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[EB/OL]. (2022-04-13)[2023-08-17]. https://arxiv.org/abs/2112.10752.
[76] FENG Z, ZHANG Z, YU X, et al. ERNIE-ViLG 2.0: improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts[EB/OL]. (2023-03-28)[2023-05-18]. https://arxiv.org/abs/2210.15257.
[77] CHEN W, HU H, SAHARIA C, et al. Re-Imagen: retrieval-augmented text-to-image generator[EB/OL]. (2022-11-22)[2023-08-09]. https://arxiv.org/abs/2209.14491.
[78] ZHENG H, HE P, CHEN W, et al. Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders[C]//Proceedings of the 11th International Conference on Learning Representations, Kigali, May 1-5, 2023: 1-28.
[79] LI R, LI W, YANG Y, et al. Swinv2-Imagen: hierarchical vision transformer diffusion models for text-to-image generation[EB/OL]. (2022-10-18)[2023-08-14]. https://arxiv.org/abs/2210.09549.
[80] XIA W, YANG Y, XUE J H, et al. Towards open-world text-guided face image generation and manipulation[EB/OL]. (2021-04-18)[2023-08-20]. https://arxiv.org/abs/2104.08910.
[81] XIA W, YANG Y, XUE J H, et al. TediGAN: text-guided diverse face image generation and manipulation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 1-13.
[82] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1877-1901.
[83] OPENAI. GPT-4 technical report[EB/OL]. (2023-12-19)[2023-12-23]. https://arxiv.org/abs/2303.08774.