
Journal of Frontiers of Computer Science and Technology ›› 2025, Vol. 19 ›› Issue (12): 3224-3242.DOI: 10.3778/j.issn.1673-9418.2511039
• Special Issue on Theory and Technology of Multimodal Large Language Model • Previous Articles Next Articles
ZHANG Rui, BIAN Zhipeng
Online:2025-12-01
Published:2025-12-01
张瑞,卞志鹏
ZHANG Rui, BIAN Zhipeng. Overview of Multimodal Generation for Recommender Systems[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(12): 3224-3242.
张瑞, 卞志鹏. 面向推荐系统的多模态生成研究综述[J]. 计算机科学与探索, 2025, 19(12): 3224-3242.
Add to citation manager EndNote|Ris|BibTeX
URL: http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2511039
| [1] WELSBY P, CHEUNG B M Y. ChatGPT[J]. Postgraduate Medical Journal, 2023, 99(1176): 1047-1048. [2] JARUGA-ROZDOLSKA A. Artificial intelligence as part of future practices in the architect’s work: MidJourney generative tool as part of a process of creating an architectural form[J]. Architectus, 2022, 71(3): 95-104. [3] LIU Y X, ZHANG K, LI Y, et al. Sora: a review on background, technology, limitations, and opportunities of large vision models[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2402.17177. [4] PEEBLES W, XIE S N. Scalable diffusion models with transformers[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 4172-4182. [5] LAI Z Q, ZHU X Z, DAI J F, et al. Mini-DALLE3: interactive text to image by prompting large language models[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2310.07653. [6] DING D, JU Z, LENG Y, et al. Kimi-audio technical report[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2504.18425. [7] DELDJOO Y, HE Z K, MCAULEY J, et al. A review of modern recommender systems using generative models (gen-RecSys)[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2024: 6448-6458. [8] YANG H, YUAN J X, YANG S, et al. A new creative generation pipeline for click-through rate with stable diffusion model[C]//Proceedings of the ACM Web Conference 2024. New York: ACM, 2024: 180-189. [9] GAO Y F, SHENG T, XIANG Y L, et al. Chat-REC: towards interactive and explainable LLMs-augmented recommender system[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2303.14524. [10] SHEN X T, ZHANG R, ZHAO X Y, et al. PMG: personalized multimodal generation with large language models[C]//Proceedings of the ACM Web Conference 2024. New York: ACM, 2024: 3833-3843. [11] LI E, LARSEN A B L, ZHANG C, et al. Apple intelligence foundation language models: tech report 2025[EB/OL].[2025-07-20]. https://arxiv.org/abs/2507.13575. [12] LIU Q J, ZHU J M, YANG Y T, et al. Multimodal pretraining, adaptation, and generation for recommendation: a survey[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2024: 6566-6576. [13] ZHU D Y, CHEN J, SHEN X Q, et al. MiniGPT-4: enhancing vision-language understanding with advanced large language models[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2304.10592. [14] HURST A, LERER A, GOUCHER A P, et al. GPT-4o system card[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2410.21276. [15] LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2301.12597. [16] TURKOGLU M O, BECKER A, GüNDüZ H A, et al. FiLM-ensemble: probabilistic deep learning via feature-wise linear modulation[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2206.00050. [17] COPET J, KREUK F, GAT I, et al. Simple and controllable music generation[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2306.05284. [18] LI B Y, YUAN L P, WANG Z Y. VideoCraft: a mixed reality-empowered video generation workflow with spatial layer editing for concept video creation[C]//Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. New York: ACM, 2025: 1-16. [19] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Advances in Neural Information Processing Systems 33, 2020 : 6840-6851. [20] WU S, FEI H, QU L, et al. NExT-GPT: any-to-any multimodal LLM[C]//Proceedings of the 41st International Conference on Machine Learning, 2024. [21] WANG L, ZHANG D, YANG F K, et al. LettinGo: explore user profile generation for recommendation system[C]//Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2025: 2985-2995. [22] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 8748-8763. [23] PEPINO L, RIERA P, FERRER L. EnCodecMAE: leveraging neural codecs for universal audio representation learning[EB/OL]. [2025-06-14]. https://arxiv.org/abs/2309.07391. [24] WU S Q, FEI H, LI X T, et al. Towards semantic equivalence of tokenization in multimodal LLM[EB/OL]. [2025-06-14]. https://arxiv.org/abs/2406.05127. [25] WANG Y K, CHEN X H, CAO L L, et al. Multimodal token fusion for vision transformers[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 12176-12185. [26] LIU X, JI K X, FU Y C, et al. P-tuning: prompt tuning can be comparable to fine-tuning across scales and tasks[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 61-68. [27] SAHA R, SAGAN N, SRIVASTAVA V, et al. Compressing large language models using low rank and low precision decomposition[C]//Proceedings of the 38th International Conference on Neural Information Processing Systems, 2024: 88981-89018. [28] TAUD H, MAS J F. Multilayer perceptron (MLP)[M]//Geomatic approaches for modeling land change scenarios. Cham: Springer, 2017: 451-455. [29] WANG H R, HUANG W Y, DENG Y, et al. UniMS-RAG: a unified multi-source retrieval-augmented generation for personalized dialogue systems[EB/OL]. [2025-06-14]. https://arxiv.org/abs/2401.13256. [30] PENG B L, LI C Y, HE P C, et al. Instruction tuning with GPT-4[EB/OL]. [2025-06-14]. https://arxiv.org/abs/2304.03277. [31] FLORIDI L, CHIRIATTI M. GPT-3: its nature, scope, limits, and consequences[J]. Minds and Machines, 2020, 30(4): 681-694. [32] ZHAI M Y, CHEN L, TUNG F, et al. Lifelong GAN: continual learning for conditional image generation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 2759-2768. [33] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008. [34] GUO Y H, WANG C F, YU S X, et al. AdaLN: a vision transformer for multidomain learning and predisaster building information extraction from images[J]. Journal of Computing in Civil Engineering, 2022, 36(5): 04022024. [35] ZHANG L M, RAO A Y, AGRAWALA M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 3813-3824. [36] SALEMI A, MYSORE S, BENDERSKY M, et al. LaMP: when large language models meet personalization[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 7370-7392. [37] DONG G T, YUAN H Y, LU K M, et al. How abilities in large language models are affected by supervised fine-tuning data composition[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 177-198. [38] HU H X, YU S, CHEN P Z, et al. Fine-tuning large language models with sequential instructions[C]//Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2025: 5589-5610. [39] OUYANG L, WU J, XU J, et al. Training language models to follow instructions with human feedback[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022: 27730-27744. [40] RUIZ N, LI Y Z, JAMPANI V, et al. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 22500-22510. [41] YE H, ZHANG J, LIU S B, et al. IP-adapter: text compatible image prompt adapter for text-to-image diffusion models[EB/OL]. [2025-06-15]. https://arxiv.org/abs/2308.06721. [42] HODSON T O. Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not[J]. Geoscientific Model Development, 2022, 15(14): 5481-5487. [43] ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 586-595. [44] WANG X, YUAN H J, ZHANG S W, et al. VideoComposer: compositional video synthesis with motion controllability[EB/OL]. [2025-06-15]. https://arxiv.org/abs/2306.02018. [45] PINHEIRO CINELLI L, ARAúJO MARINS M, BARROS DA SILVA E A, et al. Variational autoencoder[M]//Variational methods for machine learning with applications to deep networks. Cham: Springer, 2021: 111-149. [46] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. [2025-06-15]. https://arxiv.org/abs/1707.06347. [47] RAFAILOV R, SHARMA A, MITCHELL E, et al. Direct preference optimization: your language model is secretly a reward model[EB/OL]. [2025-06-15]. https://arxiv.org/abs/2305.18290. [48] JAYASUMANA S, RAMALINGAM S, VEIT A, et al. Rethinking FID: towards a better evaluation metric for image generation[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 9307-9315. [49] HESSEL J, HOLTZMAN A, FORBES M, et al. CLIPScore: a reference-free evaluation metric for image captioning[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 7514-7528. [50] LING R, WANG W J, LIU Y T, et al. RAGAR: retrieval augmented personalized image generation guided by recommendation[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2505. 01657. [51] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg: ACL, 2002: 311-318. [52] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]//Proceedings of the 2004 Workshop on Text Summarization Branches Out, 2004: 74-81. [53] GUI A, GAMPER H, BRAUN S, et al. Adapting frechet audio distance for generative music evaluation[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 1331-1335. [54] WANG X Q, WU L K, YIN S K, et al. I-AM-G: interest augmented multimodal generator for item personalization[C]//Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2024: 21303-21317. [55] XING W J, CUI Z C, QI J. SGDM: static-guided dynamic module make stronger visual models[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2403.18282. [56] XU Y Y, WANG W J, ZHANG Y, et al. Personalized image generation with large multimodal models[C]//Proceedings of the ACM on Web Conference 2025. New York: ACM, 2025: 264-274. [57] YANG T, LUO Y, QI Z, et al. PosterLLaVa: constructing a unified multi-modal layout generator with LLM[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2406.02884. [58] AO X, LUO L, WANG X T, et al. Put your voice on stage: personalized headline generation for news articles[J]. ACM Transactions on Knowledge Discovery from Data, 2023, 18(3): 1-20. [59] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. [60] CHEN Q B, LIN J Y, ZHANG Y C, et al. Towards knowledge-based personalized product description generation in E-commerce[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2019: 3040-3050. [61] XIAO W, XIE Y J, CARENINI G, et al. Personalized abstractive summarization by tri-agent generation pipeline[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2305.02483. [62] DENG Y, LI Y L, ZHANG W X, et al. Toward personalized answer generation in E-commerce via multi-perspective preference modeling[J]. ACM Transactions on Information Systems, 2022, 40(4): 1-28. [63] ZHOU J, GAO Y, LIU J, et al. GCOF: self-iterative text generation for copywriting using large language model[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2402.13667. [64] REN J, LIN L, ZHENG W. Product promotion copywriting from multimodal data: new benchmark and model[J]. Neurocomputing, 2024, 575: 127253. [65] SHIN W, PARK J, WOO T, et al. e-CLIP: large-scale vision-language representation learning in e-commerce[C]//Proceedings of the 31st ACM International Conference on Information & Knowledge Management. New York: ACM, 2022: 3484-3494. [66] WANG Y N, PEI Y, MA Z R, et al. A user-guided generation framework for personalized music synthesis using interactive evolutionary computation[C]//Proceedings of the 2024 Genetic and Evolutionary Computation Conference Companion. New York: ACM, 2024: 1762-1769. [67] KONG Z F, PING W, HUANG J J, et al. DiffWave: a versatile diffusion model for audio synthesis[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2009.09761. [68] AGOSTINELLI A, DENK T I, BORSOS Z, et al. MusicLM: generating music from text[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2301.11325. [69] TIAN Z Y, JIN Y Z, LIU Z Y, et al. AudioX: diffusion transformer for anything-to-audio generation[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2503.10522. [70] HWANG M, WEIHS L, PARK C, et al. Promptable behaviors: personalizing multi-objective rewards from human preferences[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 16216-16226. [71] SU H, YANG Y, LIU T Y, et al. Personalized question answering with user profile generation and compression[C]//Findings of the Association for Computational Linguistics: EMNLP 2025. Stroudsburg: ACL, 2025: 4744-4763. [72] KUMAR V, BLACK A W. ClarQ: a large-scale and diverse dataset for clarification question generation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 7296-7301. [73] SHENG X R, YANG F F, GONG L T, et al. Enhancing Taobao display advertising with multimodal representations: challenges, approaches and insights[C]//Proceedings of the 33rd ACM International Conference on Information and Knowledge Management. New York: ACM, 2024: 4858-4865. [74] HAQUE T U, SABER N N, SHAH F M. Sentiment analysis on large scale Amazon product reviews[C]//Proceedings of the 2018 IEEE International Conference on Innovative Research and Development. Piscataway: IEEE, 2018: 1-6. [75] GAO C M, LI S J, LEI W Q, et al. KuaiRec: a fully-observed dataset and insights for evaluating recommender systems[C]//Proceedings of the 31st ACM International Conference on Information & Knowledge Management. New York: ACM, 2022: 540-550. [76] CHENG Y, PAN Y Z, ZHANG J Q, et al. An image dataset for benchmarking recommender systems with raw pixels[C]//Proceedings of the 2024 SIAM International Conference on Data Mining, 2024: 418-426. [77] WU F Z, QIAO Y, CHEN J H, et al. MIND: a large-scale dataset for news recommendation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3597-3606. [78] ASGHAR N. Yelp dataset challenge: review rating prediction[EB/OL]. [2025-06-17]. https://arxiv.org/abs/1605.05362. [79] LIU Y X, ZHANG W N, DONG B H, et al. U-NEED: a fine-grained dataset for user needs-centric E-commerce conversational recommendation[C]//Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2023: 2723-2732. [80] CHEN X L, FANG H, LIN T Y, et al. Microsoft COCO captions: data collection and evaluation server[EB/OL]. [2025-06-17]. https://arxiv.org/abs/1504.00325. [81] PLUMMER B A, WANG L W, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision.Piscataway: IEEE, 2015: 2641-2649. [82] XU J, MEI T, YAO T, et al. MSR-VTT: a large video description dataset for bridging video and language[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 5288-5296. [83] ZOU Y Y, XIE Q Y. A survey on VQA: datasets and approaches[C]//Proceedings of the 2020 2nd International Conference on Information Technology and Computer Application. Piscataway: IEEE, 2020: 289-297. [84] SHARMA P, DING N, GOODMAN S, et al. Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2556-2565. [85] ZHANG R, LIU C, SU Y X, et al. A comprehensive survey on multimodal RAG: all combinations of modalities as input and output[EB/OL]. [2025-06-17]. https://www.techrxiv.org/users/994967/articles/1356133-a-comprehensive-survey-on-multimodal-rag-all-combinations-of-modalities-as-input-and-output. |
| [1] | Anggeluma, WANG Siriguleng, SI Qintu. Overview of Research on Knowledge Graph Completion [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(9): 2302-2318. |
| [2] | TIAN Chongteng, LIU Jing, WANG Xiaoyan, LI Ming. Review of Application of Large Language Models GPT in Medical Text [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(8): 2043-2056. |
| [3] | XIA Jianglan, LI Yanling, GE Fengpei. Survey of Entity Relation Extraction Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1681-1698. |
| [4] | ZHANG Xin, SUN Jingchao. Review of False Information Detection Frameworks Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1414-1436. |
| [5] | XU Delong, LIN Min, WANG Yurong, ZHANG Shujun. Survey of NLP Data Augmentation Methods Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(6): 1395-1413. |
| [6] | LI Juhao, SHI Lei, DING Meng, LEI Yongsheng, ZHAO Dongyue, CHEN Long. Social Media Text Stance Detection Based on Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(5): 1302-1312. |
| [7] | JIANG Hang, CAI Guoyong, LI Sihui. Sequence-to-Sequence Text Generation with Discrete Diffusion Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(3): 764-773. |
| [8] | CHANG Baofa, CHE Chao, LIANG Yan. Research on Recommendation Model Based on Multi-round Dialogue of Large Language Model [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(2): 385-395. |
| [9] | XUE Di, LI Xin, JIANG Zhangtao, WANG Xiaoyu, LIU Mingshuai. Technical Framework for Visual Question Answering System of Business Knowledge for Case-Related Property [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(12): 3267-3278. |
| [10] | BA Zezhi, ZHANG Hui, XIE Zhenghan, ZUO Xiaodong, HOU Jianwei. Automatic Prompt Engineering Technology for Large Language Models: a Survey [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(12): 3131-3152. |
| [11] | HUANG Jiajia, ZHU Haoran, JIANG Maowei, CHEN Yong, XU Chao. Imbalanced Instruction Filtering Strategy for Fine-Tuning Audit Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(12): 3353-3367. |
| [12] | XU Guoyu, ZHANG Yidan, WEI Xiao, MAO Yangmin. Retrieval-Augmented Perception in Multimodal Large Language Models via Adaptive Routing and Dual-Threshold Pruning [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(12): 3257-3266. |
| [13] | ZHANG Jing, HUANG Wenfeng, WU Chunjiang, TAN Hao. Overview of Knowledge Graph Construction and Reasoning Enhanced by Large Language Models [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(11): 2855-2872. |
| [14] | FENG Sicong, PENG Li. Few-Shot Motion Pattern Learning and Video Generation Control Strategy [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(11): 2994-3006. |
| [15] | SHI Dongyan, MA Lerong, DING Cangfeng, NING Qinwei, CAO Jiangjiang. Advances in Text Clustering Models Based on Deep Learning Approaches [J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(11): 2873-2894. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/