计算机科学与探索

• 学术研究 •    下一篇

面向扩散大模型的多模态人脸生成方法

黄万鑫,任英杰,芦天亮,杨刚,袁梦姣,曾高俊   

  1. 1. 中国人民公安大学 信息网络安全学院, 北京 100038
    2. 中国人民公安大学 公安行业大模型研究与应用实验室, 北京 100038
    3. 公安部网络安全保卫局, 北京 100741

Multimodal Face Generation Method for Diffusion Large Models

HUANG Wanxin,  REN Yingjie,  LU Tianliang,  YANG Gang,  YUAN Mengjiao,  ZENG Gaojun   

  1. 1. College of Information and Cyber Security, People's Public Security University of China, Beijing 100038, China
    2. Public Security Large Model Research and Application Laboratory, People's Public Security University of China, Beijing 100038, China
    3. Network Security Bureau, The Ministry of Public Security of the People's Republic of China, Beijing 100741, China

摘要: 人脸生成是计算机视觉的前沿课题,在刑事侦查、虚拟现实等领域有广泛的应用前景。近年来扩散模型展现出卓越的生成能力,能够根据限定条件生成高语义一致性的图片,应用于生成人脸方向上成为新趋势。然而,现有生成方法中基于常规扩散的方法对条件信息细节理解不足,未能充分利用条件信息精准生成人脸;基于扩散大模型的方法通常需耗费大量计算资源微调模型,抑或添加额外复杂网络且未能均衡融合多模态条件信息。针对上述挑战,提出面向扩散大模型的多模态生成人脸方法MA-adapter,添加小型精简网络提取视觉结构信息融合语义指导扩散大模型精准生成人脸,充分利用扩散大模型生成能力的同时避免耗费大量计算资源进行微调。该模型首先利用多头注意力模块(multi-head attention module,MAM)增强图片模态提示,使模型更加关注关键信息;随后通过多尺度特征模块(multi-scale ,feature module,MFM)提取多尺度特征信息,为精准指导生成提供保障;最后设计自适应调节机制(adaptive adjustment mechanism,AAM),自适应调节不同特征层的生成指导系数以实现更佳性能。实验结果表明,在MM-CelebA-HQ(multi-modal-celeba-hq)数据集上与当前主流方法T2I-adapter相比,MA-adapter的感知相似指标LPIPS下降约18.4%,图文匹配指标CLIP-Score提高约13.6%,特征相似指标CLIP-I增长约14.8%。大量实验结果充分验证MA-adapter的有效性及优越性。

关键词: 人脸生成, 多模态, 扩散模型, 智能生成, 注意力机制

Abstract: Face generation represents a cutting-edge topic within the field of computer vision, boasting broad application prospects in areas such as criminal investigation and virtual reality. Recently, diffusion models have exhibited outstanding generative capabilities, capable of producing images with high semantic consistency under specific conditions. Their application in facial generation has become a new trend. However, existing methods based on conventional diffusion inadequately handle the details of conditional information, failing to fully exploit such information for the precise generation of faces. Methods based on large diffusion models typically require significant computational resources to fine-tune the model or necessitate the addition of extra complex networks without achieving a balanced integration of multimodal conditional information. To address these challenges, this study proposes a multimodal generative facial method for diffusion-based large models called MA-adapter. By incorporating a compact auxiliary network to extract visual structural information and integrate semantic guidance, the method aims to harness the generative capabilities of diffusion large models while avoiding the substantial computational resources required for fine-tuning. This model first enhances image modality prompts using a multi-head attention module, focusing the model more on key information. Subsequently, it extracts multi-scale feature information through a multi-scale feature module, providing guarantees for precise generation guidance. Finally, an adaptive adjustment mechanism is designed to adaptively adjust the generation guidance coefficients of different feature layers to achieve better performance. Experimental results on the MM-CelebA-HQ (multi-modal-celeba-hq) dataset show that compared with the current state-of-the-art method T2I-adapter, the perceptual similarity metric LPIPS of the MA-adapter decreases by approximately 18.4%, the image-text matching metric CLIP-Score increases by about 13.6%, and the feature similarity metric CLIP-I grows by approximately 14.8%. Extensive experimental results fully validate the effectiveness and superiority of the MA-adapter.

Key words: face generation, multimodal, diffusion model, intelligent generation, attention mechanism