计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (12): 3224-3242.DOI: 10.3778/j.issn.1673-9418.2511039

• 多模态大模型理论与技术专题 • 上一篇    下一篇

面向推荐系统的多模态生成研究综述

张瑞,卞志鹏   

  1. 华中科技大学 计算机科学与技术学院,武汉 430074
  • 出版日期:2025-12-01 发布日期:2025-12-01

Overview of Multimodal Generation for Recommender Systems

ZHANG Rui, BIAN Zhipeng   

  1. School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
  • Online:2025-12-01 Published:2025-12-01

摘要: 随着大语言模型与多模态生成模型的快速发展,推荐系统正从“匹配现有内容”向“生成个性化内容”的范式转型,催生了个性化多模态生成这一新兴研究方向。个性化多模态生成强调根据用户历史行为与生成目标指令,输出可直接用于推荐流程的符合用户偏好的文本、图像、音频或视频内容,从而提升用户体验与推荐系统的效果。尽管近年来相关技术快速演进,已有研究在图像、文本等模态生成中初步展现出良好效果,但在方法定义、关键技术、任务共性与研究范式等方面仍缺乏系统总结与统一视角。为此,聚焦推荐场景中的个性化多模态生成问题展开系统性综述,率先界定了“偏好捕捉、目标内容与个性化生成”的三元建模关系,将个性化多模态生成严格限定为:在推荐系统中,基于用户历史行为和画像所捕捉的个性偏好,生成直接作为推荐候选或展示内容的多模态输出(如封面图、新闻标题、音视频片段等),而非一般意义上的开放式文生图或对话生成任务。随后构建统一的技术框架,围绕“偏好与目标建模”“偏好注入与生成器结构”“优化策略与个性化输出”三大核心模块展开,并结合图像、文本、音频与跨模态任务总结典型技术路径和应用场景。此外,对现有评估指标及其在衡量个性化与推荐有效性方面的局限进行了批判性分析,并讨论大型多模态模型在推荐系统中的适配性、推理效率与安全性挑战。最后展望未来的发展方向,希望为个性化多模态生成研究提供系统化参考。

关键词: 多模态生成, 推荐系统, 个性化生成, 扩散模型, 大语言模型

Abstract: With the rapid development of large language models and multimodal generative techniques, recommender systems are moving from matching existing content to generating personalized content, which motivates the emerging task of personalized multimodal generation. This task focuses on producing user aligned text, image, audio, or video content based on historical behaviors and target instructions so that the generated outputs can be directly used as recommendation candidates or display materials. Although recent studies have shown promising results in several modalities, existing research remains fragmented and lacks a unified definition, technical formulation, and methodological structure. To address this gap, this survey provides a systematic overview of personalized multimodal generation in recommender systems. It formalizes a three component modeling relationship consisting of preference capture, target content, and personalized generation, and clearly defines the task scope as generating multimodal outputs that serve recommendation purposes (such as cover images, news headlines, audio and video clips) rather than general text to image or open domain dialogue generation. It then constructs a unified framework covering preference and target modeling, preference injection and generator architectures, and optimization strategies for personalized outputs, and summarizes representative techniques across image, text, audio, and cross-modal generation together with typical application scenarios. In addition, it analyzes the limitations of existing evaluation metrics, highlights their insufficiency in measuring personalization and recommendation effectiveness, and discusses the challenges brought by large multimodal models in terms of model adaptation, inference efficiency, and system safety. Finally, this survey outlines future research directions to support the continued development of personalized multimodal generation in recommender systems.

Key words: multimodal generation, recommender systems, personalized generation, diffusion model, large language models