
计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (12): 3224-3242.DOI: 10.3778/j.issn.1673-9418.2511039
张瑞,卞志鹏
出版日期:2025-12-01
发布日期:2025-12-01
ZHANG Rui, BIAN Zhipeng
Online:2025-12-01
Published:2025-12-01
摘要: 随着大语言模型与多模态生成模型的快速发展,推荐系统正从“匹配现有内容”向“生成个性化内容”的范式转型,催生了个性化多模态生成这一新兴研究方向。个性化多模态生成强调根据用户历史行为与生成目标指令,输出可直接用于推荐流程的符合用户偏好的文本、图像、音频或视频内容,从而提升用户体验与推荐系统的效果。尽管近年来相关技术快速演进,已有研究在图像、文本等模态生成中初步展现出良好效果,但在方法定义、关键技术、任务共性与研究范式等方面仍缺乏系统总结与统一视角。为此,聚焦推荐场景中的个性化多模态生成问题展开系统性综述,率先界定了“偏好捕捉、目标内容与个性化生成”的三元建模关系,将个性化多模态生成严格限定为:在推荐系统中,基于用户历史行为和画像所捕捉的个性偏好,生成直接作为推荐候选或展示内容的多模态输出(如封面图、新闻标题、音视频片段等),而非一般意义上的开放式文生图或对话生成任务。随后构建统一的技术框架,围绕“偏好与目标建模”“偏好注入与生成器结构”“优化策略与个性化输出”三大核心模块展开,并结合图像、文本、音频与跨模态任务总结典型技术路径和应用场景。此外,对现有评估指标及其在衡量个性化与推荐有效性方面的局限进行了批判性分析,并讨论大型多模态模型在推荐系统中的适配性、推理效率与安全性挑战。最后展望未来的发展方向,希望为个性化多模态生成研究提供系统化参考。
张瑞, 卞志鹏. 面向推荐系统的多模态生成研究综述[J]. 计算机科学与探索, 2025, 19(12): 3224-3242.
ZHANG Rui, BIAN Zhipeng. Overview of Multimodal Generation for Recommender Systems[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(12): 3224-3242.
| [1] WELSBY P, CHEUNG B M Y. ChatGPT[J]. Postgraduate Medical Journal, 2023, 99(1176): 1047-1048. [2] JARUGA-ROZDOLSKA A. Artificial intelligence as part of future practices in the architect’s work: MidJourney generative tool as part of a process of creating an architectural form[J]. Architectus, 2022, 71(3): 95-104. [3] LIU Y X, ZHANG K, LI Y, et al. Sora: a review on background, technology, limitations, and opportunities of large vision models[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2402.17177. [4] PEEBLES W, XIE S N. Scalable diffusion models with transformers[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 4172-4182. [5] LAI Z Q, ZHU X Z, DAI J F, et al. Mini-DALLE3: interactive text to image by prompting large language models[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2310.07653. [6] DING D, JU Z, LENG Y, et al. Kimi-audio technical report[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2504.18425. [7] DELDJOO Y, HE Z K, MCAULEY J, et al. A review of modern recommender systems using generative models (gen-RecSys)[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2024: 6448-6458. [8] YANG H, YUAN J X, YANG S, et al. A new creative generation pipeline for click-through rate with stable diffusion model[C]//Proceedings of the ACM Web Conference 2024. New York: ACM, 2024: 180-189. [9] GAO Y F, SHENG T, XIANG Y L, et al. Chat-REC: towards interactive and explainable LLMs-augmented recommender system[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2303.14524. [10] SHEN X T, ZHANG R, ZHAO X Y, et al. PMG: personalized multimodal generation with large language models[C]//Proceedings of the ACM Web Conference 2024. New York: ACM, 2024: 3833-3843. [11] LI E, LARSEN A B L, ZHANG C, et al. Apple intelligence foundation language models: tech report 2025[EB/OL].[2025-07-20]. https://arxiv.org/abs/2507.13575. [12] LIU Q J, ZHU J M, YANG Y T, et al. Multimodal pretraining, adaptation, and generation for recommendation: a survey[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2024: 6566-6576. [13] ZHU D Y, CHEN J, SHEN X Q, et al. MiniGPT-4: enhancing vision-language understanding with advanced large language models[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2304.10592. [14] HURST A, LERER A, GOUCHER A P, et al. GPT-4o system card[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2410.21276. [15] LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2301.12597. [16] TURKOGLU M O, BECKER A, GüNDüZ H A, et al. FiLM-ensemble: probabilistic deep learning via feature-wise linear modulation[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2206.00050. [17] COPET J, KREUK F, GAT I, et al. Simple and controllable music generation[EB/OL]. [2025-06-13]. https://arxiv.org/abs/2306.05284. [18] LI B Y, YUAN L P, WANG Z Y. VideoCraft: a mixed reality-empowered video generation workflow with spatial layer editing for concept video creation[C]//Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. New York: ACM, 2025: 1-16. [19] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Advances in Neural Information Processing Systems 33, 2020 : 6840-6851. [20] WU S, FEI H, QU L, et al. NExT-GPT: any-to-any multimodal LLM[C]//Proceedings of the 41st International Conference on Machine Learning, 2024. [21] WANG L, ZHANG D, YANG F K, et al. LettinGo: explore user profile generation for recommendation system[C]//Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2025: 2985-2995. [22] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 8748-8763. [23] PEPINO L, RIERA P, FERRER L. EnCodecMAE: leveraging neural codecs for universal audio representation learning[EB/OL]. [2025-06-14]. https://arxiv.org/abs/2309.07391. [24] WU S Q, FEI H, LI X T, et al. Towards semantic equivalence of tokenization in multimodal LLM[EB/OL]. [2025-06-14]. https://arxiv.org/abs/2406.05127. [25] WANG Y K, CHEN X H, CAO L L, et al. Multimodal token fusion for vision transformers[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 12176-12185. [26] LIU X, JI K X, FU Y C, et al. P-tuning: prompt tuning can be comparable to fine-tuning across scales and tasks[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 61-68. [27] SAHA R, SAGAN N, SRIVASTAVA V, et al. Compressing large language models using low rank and low precision decomposition[C]//Proceedings of the 38th International Conference on Neural Information Processing Systems, 2024: 88981-89018. [28] TAUD H, MAS J F. Multilayer perceptron (MLP)[M]//Geomatic approaches for modeling land change scenarios. Cham: Springer, 2017: 451-455. [29] WANG H R, HUANG W Y, DENG Y, et al. UniMS-RAG: a unified multi-source retrieval-augmented generation for personalized dialogue systems[EB/OL]. [2025-06-14]. https://arxiv.org/abs/2401.13256. [30] PENG B L, LI C Y, HE P C, et al. Instruction tuning with GPT-4[EB/OL]. [2025-06-14]. https://arxiv.org/abs/2304.03277. [31] FLORIDI L, CHIRIATTI M. GPT-3: its nature, scope, limits, and consequences[J]. Minds and Machines, 2020, 30(4): 681-694. [32] ZHAI M Y, CHEN L, TUNG F, et al. Lifelong GAN: continual learning for conditional image generation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 2759-2768. [33] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008. [34] GUO Y H, WANG C F, YU S X, et al. AdaLN: a vision transformer for multidomain learning and predisaster building information extraction from images[J]. Journal of Computing in Civil Engineering, 2022, 36(5): 04022024. [35] ZHANG L M, RAO A Y, AGRAWALA M. Adding conditional control to text-to-image diffusion models[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 3813-3824. [36] SALEMI A, MYSORE S, BENDERSKY M, et al. LaMP: when large language models meet personalization[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 7370-7392. [37] DONG G T, YUAN H Y, LU K M, et al. How abilities in large language models are affected by supervised fine-tuning data composition[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 177-198. [38] HU H X, YU S, CHEN P Z, et al. Fine-tuning large language models with sequential instructions[C]//Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2025: 5589-5610. [39] OUYANG L, WU J, XU J, et al. Training language models to follow instructions with human feedback[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022: 27730-27744. [40] RUIZ N, LI Y Z, JAMPANI V, et al. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 22500-22510. [41] YE H, ZHANG J, LIU S B, et al. IP-adapter: text compatible image prompt adapter for text-to-image diffusion models[EB/OL]. [2025-06-15]. https://arxiv.org/abs/2308.06721. [42] HODSON T O. Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not[J]. Geoscientific Model Development, 2022, 15(14): 5481-5487. [43] ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 586-595. [44] WANG X, YUAN H J, ZHANG S W, et al. VideoComposer: compositional video synthesis with motion controllability[EB/OL]. [2025-06-15]. https://arxiv.org/abs/2306.02018. [45] PINHEIRO CINELLI L, ARAúJO MARINS M, BARROS DA SILVA E A, et al. Variational autoencoder[M]//Variational methods for machine learning with applications to deep networks. Cham: Springer, 2021: 111-149. [46] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. [2025-06-15]. https://arxiv.org/abs/1707.06347. [47] RAFAILOV R, SHARMA A, MITCHELL E, et al. Direct preference optimization: your language model is secretly a reward model[EB/OL]. [2025-06-15]. https://arxiv.org/abs/2305.18290. [48] JAYASUMANA S, RAMALINGAM S, VEIT A, et al. Rethinking FID: towards a better evaluation metric for image generation[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 9307-9315. [49] HESSEL J, HOLTZMAN A, FORBES M, et al. CLIPScore: a reference-free evaluation metric for image captioning[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 7514-7528. [50] LING R, WANG W J, LIU Y T, et al. RAGAR: retrieval augmented personalized image generation guided by recommendation[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2505. 01657. [51] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg: ACL, 2002: 311-318. [52] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]//Proceedings of the 2004 Workshop on Text Summarization Branches Out, 2004: 74-81. [53] GUI A, GAMPER H, BRAUN S, et al. Adapting frechet audio distance for generative music evaluation[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 1331-1335. [54] WANG X Q, WU L K, YIN S K, et al. I-AM-G: interest augmented multimodal generator for item personalization[C]//Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2024: 21303-21317. [55] XING W J, CUI Z C, QI J. SGDM: static-guided dynamic module make stronger visual models[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2403.18282. [56] XU Y Y, WANG W J, ZHANG Y, et al. Personalized image generation with large multimodal models[C]//Proceedings of the ACM on Web Conference 2025. New York: ACM, 2025: 264-274. [57] YANG T, LUO Y, QI Z, et al. PosterLLaVa: constructing a unified multi-modal layout generator with LLM[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2406.02884. [58] AO X, LUO L, WANG X T, et al. Put your voice on stage: personalized headline generation for news articles[J]. ACM Transactions on Knowledge Discovery from Data, 2023, 18(3): 1-20. [59] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. [60] CHEN Q B, LIN J Y, ZHANG Y C, et al. Towards knowledge-based personalized product description generation in E-commerce[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2019: 3040-3050. [61] XIAO W, XIE Y J, CARENINI G, et al. Personalized abstractive summarization by tri-agent generation pipeline[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2305.02483. [62] DENG Y, LI Y L, ZHANG W X, et al. Toward personalized answer generation in E-commerce via multi-perspective preference modeling[J]. ACM Transactions on Information Systems, 2022, 40(4): 1-28. [63] ZHOU J, GAO Y, LIU J, et al. GCOF: self-iterative text generation for copywriting using large language model[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2402.13667. [64] REN J, LIN L, ZHENG W. Product promotion copywriting from multimodal data: new benchmark and model[J]. Neurocomputing, 2024, 575: 127253. [65] SHIN W, PARK J, WOO T, et al. e-CLIP: large-scale vision-language representation learning in e-commerce[C]//Proceedings of the 31st ACM International Conference on Information & Knowledge Management. New York: ACM, 2022: 3484-3494. [66] WANG Y N, PEI Y, MA Z R, et al. A user-guided generation framework for personalized music synthesis using interactive evolutionary computation[C]//Proceedings of the 2024 Genetic and Evolutionary Computation Conference Companion. New York: ACM, 2024: 1762-1769. [67] KONG Z F, PING W, HUANG J J, et al. DiffWave: a versatile diffusion model for audio synthesis[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2009.09761. [68] AGOSTINELLI A, DENK T I, BORSOS Z, et al. MusicLM: generating music from text[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2301.11325. [69] TIAN Z Y, JIN Y Z, LIU Z Y, et al. AudioX: diffusion transformer for anything-to-audio generation[EB/OL]. [2025-06-16]. https://arxiv.org/abs/2503.10522. [70] HWANG M, WEIHS L, PARK C, et al. Promptable behaviors: personalizing multi-objective rewards from human preferences[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 16216-16226. [71] SU H, YANG Y, LIU T Y, et al. Personalized question answering with user profile generation and compression[C]//Findings of the Association for Computational Linguistics: EMNLP 2025. Stroudsburg: ACL, 2025: 4744-4763. [72] KUMAR V, BLACK A W. ClarQ: a large-scale and diverse dataset for clarification question generation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 7296-7301. [73] SHENG X R, YANG F F, GONG L T, et al. Enhancing Taobao display advertising with multimodal representations: challenges, approaches and insights[C]//Proceedings of the 33rd ACM International Conference on Information and Knowledge Management. New York: ACM, 2024: 4858-4865. [74] HAQUE T U, SABER N N, SHAH F M. Sentiment analysis on large scale Amazon product reviews[C]//Proceedings of the 2018 IEEE International Conference on Innovative Research and Development. Piscataway: IEEE, 2018: 1-6. [75] GAO C M, LI S J, LEI W Q, et al. KuaiRec: a fully-observed dataset and insights for evaluating recommender systems[C]//Proceedings of the 31st ACM International Conference on Information & Knowledge Management. New York: ACM, 2022: 540-550. [76] CHENG Y, PAN Y Z, ZHANG J Q, et al. An image dataset for benchmarking recommender systems with raw pixels[C]//Proceedings of the 2024 SIAM International Conference on Data Mining, 2024: 418-426. [77] WU F Z, QIAO Y, CHEN J H, et al. MIND: a large-scale dataset for news recommendation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3597-3606. [78] ASGHAR N. Yelp dataset challenge: review rating prediction[EB/OL]. [2025-06-17]. https://arxiv.org/abs/1605.05362. [79] LIU Y X, ZHANG W N, DONG B H, et al. U-NEED: a fine-grained dataset for user needs-centric E-commerce conversational recommendation[C]//Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2023: 2723-2732. [80] CHEN X L, FANG H, LIN T Y, et al. Microsoft COCO captions: data collection and evaluation server[EB/OL]. [2025-06-17]. https://arxiv.org/abs/1504.00325. [81] PLUMMER B A, WANG L W, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision.Piscataway: IEEE, 2015: 2641-2649. [82] XU J, MEI T, YAO T, et al. MSR-VTT: a large video description dataset for bridging video and language[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 5288-5296. [83] ZOU Y Y, XIE Q Y. A survey on VQA: datasets and approaches[C]//Proceedings of the 2020 2nd International Conference on Information Technology and Computer Application. Piscataway: IEEE, 2020: 289-297. [84] SHARMA P, DING N, GOODMAN S, et al. Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2556-2565. [85] ZHANG R, LIU C, SU Y X, et al. A comprehensive survey on multimodal RAG: all combinations of modalities as input and output[EB/OL]. [2025-06-17]. https://www.techrxiv.org/users/994967/articles/1356133-a-comprehensive-survey-on-multimodal-rag-all-combinations-of-modalities-as-input-and-output. |
| [1] | 景丽, 郑公浩, 李晓涵, 蔚梦媛. 基于跨模态图掩码和特征增强的推荐方法[J]. 计算机科学与探索, 2025, 19(9): 2470-2478. |
| [2] | 昂格鲁玛, 王斯日古楞, 斯琴图. 知识图谱补全研究综述[J]. 计算机科学与探索, 2025, 19(9): 2302-2318. |
| [3] | 王劲滔, 孟琪翔, 高志霖, 卜凡亮. 基于大语言模型指令微调的案件信息要素抽取方法研究[J]. 计算机科学与探索, 2025, 19(8): 2161-2173. |
| [4] | 田崇腾, 刘静, 王晓燕, 李明. 大语言模型GPT在医疗文本中的应用综述[J]. 计算机科学与探索, 2025, 19(8): 2043-2056. |
| [5] | 夏江镧, 李艳玲, 葛凤培. 基于大语言模型的实体关系抽取综述[J]. 计算机科学与探索, 2025, 19(7): 1681-1698. |
| [6] | 时振普, 吕潇, 董彦如, 刘静, 王晓燕. 医学领域多模态知识图谱融合技术发展现状研究[J]. 计算机科学与探索, 2025, 19(7): 1729-1746. |
| [7] | 韩竹轩, 卜凡亮, 侯智文, 齐彬廷, 曹恩奇. 改进KGAT的恐怖组织空间行为预测方法[J]. 计算机科学与探索, 2025, 19(7): 1918-1930. |
| [8] | 崔健, 汪永伟, 李飞扬, 李强, 苏北荣, 张小健. 结合知识蒸馏的中文文本摘要生成方法[J]. 计算机科学与探索, 2025, 19(7): 1899-1908. |
| [9] | 沙潇, 王建文, 丁建川, 徐笑然. 融合层级知识图谱嵌入与注意力机制的推荐方法[J]. 计算机科学与探索, 2025, 19(6): 1508-1521. |
| [10] | 张欣, 孙靖超. 基于大语言模型的虚假信息检测框架综述[J]. 计算机科学与探索, 2025, 19(6): 1414-1436. |
| [11] | 许德龙, 林民, 王玉荣, 张树钧. 基于大语言模型的NLP数据增强方法综述[J]. 计算机科学与探索, 2025, 19(6): 1395-1413. |
| [12] | 周家旋, 柳先辉, 赵晓东, 侯文龙, 赵卫东. 融合自适应超图的自监督知识感知推荐模型[J]. 计算机科学与探索, 2025, 19(5): 1217-1229. |
| [13] | 何静, 沈阳, 谢润锋. 大语言模型幻觉现象的分类识别与优化研究[J]. 计算机科学与探索, 2025, 19(5): 1295-1301. |
| [14] | 李居昊, 石磊, 丁锰, 雷永升, 赵东越, 陈泷. 基于大语言模型的社交媒体文本立场检测[J]. 计算机科学与探索, 2025, 19(5): 1302-1312. |
| [15] | 刘华玲, 张子龙, 彭宏帅. 面向闭源大语言模型的增强研究综述[J]. 计算机科学与探索, 2025, 19(5): 1141-1156. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||