Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (12): 3335-3347.DOI: 10.3778/j.issn.1673-9418.2311053

• Network·Security • Previous Articles     Next Articles

Deepfake Detection Method Integrating Multiple Parameter-Efficient Fine-Tuning Techniques

ZHANG Yiwen, CAI Manchun, CHEN Yonghao, ZHU Yi, YAO Lifeng   

  1. College of Information and Cyber Security, People’s Public Security University of China, Beijing 100038, China
  • Online:2024-12-01 Published:2024-11-29

融合多种参数高效微调技术的深度伪造检测方法

张溢文,蔡满春,陈咏豪,朱懿,姚利峰   

  1. 中国人民公安大学 信息网络安全学院,北京 100038

Abstract: In recent years, as deepfake technology matures, face-swapping software and synthesized videos have become widespread. While these techniques offer entertainment, they also provide opportunities for misuse by malicious actors. Consequently, the significance of deepfake detection technology has grown markedly. Existing methods for deepfake detection commonly suffer from issues including poor cross-compression robustness, weak cross-dataset generalization, and high model training overheads. To address these challenges, this paper proposes a deepfake detection approach that combines multiple parameter-efficient fine-tuning techniques. This method utilizes a visual Transformer model pretrained with the masked image modeling self-supervised method as its backbone. Initially, it employs the low-rank adaptation (LoRA) method for fine-tuning the self-attention module parameters of the pretrained model. Concurrently, it introduces a parallel structure incorporating convolutional adapters to capture local texture information, enhancing the model’s adaptability in deepfake detection tasks. Subsequently, a serial structure introduces classical adapters to fine-tune the feed-forward network of the pretrained model, maximizing the utilization of knowledge acquired during the pretraining phase. Ultimately, a multi-layer perceptron replaces the original pretrained model’s classification head for deepfake detection. Experimental results across six datasets demonstrate that this model achieves an average frame-level AUC of approximately 0.996 with only 2×107 trainable parameters. In cross-compression experiments, the average frame-level AUC drop is 0.135. In cross-dataset generalization experiments, the frame-level AUC averages around 0.765.

Key words: deepfakes, vision Transformer, self-supervised pretrained models, low-rank adaptation (LoRA), parameter-efficient fine-tuning

摘要: 近年来,随着深度伪造技术趋于成熟,换脸软件、合成视频已经随处可见。尽管深度伪造技术为人们带来了娱乐,但同时也为不法分子提供了滥用的机会。因此,深度伪造检测技术的重要性也日益凸显。现有的深度伪造检测方法普遍存在跨压缩率鲁棒性差、跨数据集泛化性差以及模型训练开销大等问题。为解决上述问题,提出一种融合多种参数高效微调技术的深度伪造检测方法,使用以掩码图像建模(MIM)自监督方法预训练的视觉自注意力模型作为主干,使用克罗内克积改进的低秩自适应方法对预训练模型的自注意力模块参数进行微调,同时采用并行结构加入卷积适配器对图像局部纹理信息进行学习,以增强预训练模型在深度伪造检测任务中的适应能力,采用并行结构引入经典适配器对预训练模型的前馈网络微调以充分利用预训练阶段学习到的知识,使用多层感知机代替原预训练模型分类头实现深度伪造检测。在六个数据集上的实验结果表明,该模型在可训练参数仅有2×107的情况下,在六个主流数据集上实现了平均约0.996的帧水平AUC。在跨压缩率实验中,帧水平AUC的平均下降为0.135。在跨数据集泛化性实验中,帧水平AUC达到了平均0.765。

关键词: 深度伪造, 视觉自注意力模型, 自监督预训练模型, 低秩自适应, 参数高效微调