Deepfake Detection Method Integrating Multiple Parameter-Efficient Fine-Tuning Techniques

doi:10.3778/j.issn.1673-9418.2311053

Abstract

Abstract: In recent years, as deepfake technology matures, face-swapping software and synthesized videos have become widespread. While these techniques offer entertainment, they also provide opportunities for misuse by malicious actors. Consequently, the significance of deepfake detection technology has grown markedly. Existing methods for deepfake detection commonly suffer from issues including poor cross-compression robustness, weak cross-dataset generalization, and high model training overheads. To address these challenges, this paper proposes a deepfake detection approach that combines multiple parameter-efficient fine-tuning techniques. This method utilizes a visual Transformer model pretrained with the masked image modeling self-supervised method as its backbone. Initially, it employs the low-rank adaptation (LoRA) method for fine-tuning the self-attention module parameters of the pretrained model. Concurrently, it introduces a parallel structure incorporating convolutional adapters to capture local texture information, enhancing the model’s adaptability in deepfake detection tasks. Subsequently, a serial structure introduces classical adapters to fine-tune the feed-forward network of the pretrained model, maximizing the utilization of knowledge acquired during the pretraining phase. Ultimately, a multi-layer perceptron replaces the original pretrained model’s classification head for deepfake detection. Experimental results across six datasets demonstrate that this model achieves an average frame-level AUC of approximately 0.996 with only 2×107 trainable parameters. In cross-compression experiments, the average frame-level AUC drop is 0.135. In cross-dataset generalization experiments, the frame-level AUC averages around 0.765.

Key words: deepfakes, vision Transformer, self-supervised pretrained models, low-rank adaptation (LoRA), parameter-efficient fine-tuning

摘要： 近年来，随着深度伪造技术趋于成熟，换脸软件、合成视频已经随处可见。尽管深度伪造技术为人们带来了娱乐，但同时也为不法分子提供了滥用的机会。因此，深度伪造检测技术的重要性也日益凸显。现有的深度伪造检测方法普遍存在跨压缩率鲁棒性差、跨数据集泛化性差以及模型训练开销大等问题。为解决上述问题，提出一种融合多种参数高效微调技术的深度伪造检测方法，使用以掩码图像建模（MIM）自监督方法预训练的视觉自注意力模型作为主干，使用克罗内克积改进的低秩自适应方法对预训练模型的自注意力模块参数进行微调，同时采用并行结构加入卷积适配器对图像局部纹理信息进行学习，以增强预训练模型在深度伪造检测任务中的适应能力，采用并行结构引入经典适配器对预训练模型的前馈网络微调以充分利用预训练阶段学习到的知识，使用多层感知机代替原预训练模型分类头实现深度伪造检测。在六个数据集上的实验结果表明，该模型在可训练参数仅有2×107的情况下，在六个主流数据集上实现了平均约0.996的帧水平AUC。在跨压缩率实验中，帧水平AUC的平均下降为0.135。在跨数据集泛化性实验中，帧水平AUC达到了平均0.765。

关键词: 深度伪造, 视觉自注意力模型, 自监督预训练模型, 低秩自适应, 参数高效微调

ZHANG Yiwen, CAI Manchun, CHEN Yonghao, ZHU Yi, YAO Lifeng. Deepfake Detection Method Integrating Multiple Parameter-Efficient Fine-Tuning Techniques[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(12): 3335-3347.

张溢文, 蔡满春, 陈咏豪, 朱懿, 姚利峰. 融合多种参数高效微调技术的深度伪造检测方法[J]. 计算机科学与探索, 2024, 18(12): 3335-3347.

References

[1] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[2] KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2023-08-20]. https://arxiv.org/abs/1312.6114.
[3] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Advances in Neural Information Processing Systems 33, 2020: 6840-6851.
[4] CHOLLET F. Xception: deep learning with depthwise separable convolutions[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017.?Washington: IEEE Computer Society, 2017: 1251-1258.
[5] BAO H, DONG L, PIAO S, et al. BEiT: BERT pre-training of image transformers[EB/OL]. [2023-08-20]. https://arxiv.org/abs/2106.08254.
[6] FANG Y, SUN Q, WANG X, et al. EVA-02: a visual representation for neon genesis[EB/OL]. [2023-08-20]. https://arxiv.org/abs/2303.11331.
[7] ISMAIL A, ELPELTAGY M, ZAKI M S, et al. A new deep learning-based methodology for video deepfake detection using XGBoost[J]. Sensors, 2021, 21(16): 5413.
[8] DAS S, SEFERBEKOV S, DATTA A, et al. Towards solving the deepfake problem: an analysis on improving deepfake detection using dynamic face augmentation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 11-17, 2021. Piscataway: IEEE, 2021: 3776-3785.
[9] KIM M, TARIQ S, WOO S S. FReTAL: generalizing deepfake detection using knowledge distillation and representation learning[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021.?Piscataway: IEEE, 2021: 1001-1012.
[10] TARIQ S, LEE S, WOO S S. A convolutional LSTM based residual network for deepfake video detection[EB/OL]. [2023-08-20]. https://arxiv.org/abs/2009.07480.
[11] SAIKIA P, DHOLARIA D, YADAV P, et al. A hybrid CNN-LSTM model for video deepfake detection by leveraging optical flow features[C]//Proceedings of the 2022 International Joint Conference on Neural Networks, Padua, Jul 18-23, 2022.?Piscataway: IEEE, 2022: 1-7.
[12] WANG J, WU Z, OUYANG W, et al. M2TR: multi-modal multi-scale transformers for deepfake detection[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, Jun 27-30, 2022.?New York: ACM, 2022: 615-623.
[13] YANG W, ZHOU X, CHEN Z, et al. AVoiD-DF: audio-visual joint learning for detecting deepfake[J]. IEEE Transactions on Information Forensics and Security, 2023, 18: 2015-2029.
[14] WANG T, CHENG H, CHOW K P, et al. Deep convolutional pooling transformer for deepfake detection[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(6): 1-20.
[15] COCCOMINI D A, MESSINA N, GENNARO C, et al. Genera-tive adversarial networks[C]//Proceedings of the 2022 International Conference on Image Analysis and Processing. Cham: Springer, 2022: 219-229.
[16] ZHANG D, LIN F, HUA Y, et al. Deepfake video detection with spatiotemporal dropout transformer[C]//Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Oct 10-14, 2022.?New York: ACM, 2022: 5833-5841.
[17] LESTER B, AL-RFOU R, CONSTANT N. The power of scale for parameter-efficient prompt tuning[EB/OL]. [2023-08-20]. https://arxiv.org/abs/2104.08691.
[18] HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019. New York: PMLR, 2019: 2790-2799.
[19] HU E J, SHEN Y, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. [2023-08-20]. https://arxiv.org/abs/2106.09685.
[20] JIA M, TANG L, CHEN B C, et al. Visual prompt tuning[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 709-727.
[21] CHEN S, GE C, TONG Z, et al. AdaptFormer: adapting vision transformers for scalable visual recognition[C]//Advances in Neural Information Processing Systems 35, 2022: 16664-16678.
[22] ZHANG K, ZHANG Z, LI Z, et al. Joint face detection and alignment using multitask cascaded convolutional networks[J]. IEEE Signal Processing Letters, 2016, 23(10): 1499-1503.
[23] SU J, LU Y, PAN S, et al. RoFormer: enhanced transformer with rotary position embedding[EB/OL]. [2023-08-20]. https://arxiv.org/abs/2104.09864.
[24] RAMACHANDRAN P, ZOPH B, LE Q V. Searching for activation functions[EB/OL]. [2023-08-20]. https://arxiv.org/abs/1710.05941.
[25] AGHAJANYAN A, ZETTLEMOYER L, GUPTA S. Intrinsic dimensionality explains the effectiveness of language model fine-tuning[EB/OL]. [2023-08-20]. https://arxiv.org/abs/2012.13255.
[26] HAMEED M G A, TAHAEI M S, MOSLEH A, et al. Convolutional neural network compression through generalized Kronecker product decomposition[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence, Feb 22-Mar 1, 2022.?Menlo Park: AAAI, 2022: 771-779.
[27] 张璐, 芦天亮, 杜彦辉. 人脸视频深度伪造检测方法综述[J]. 计算机科学与探索, 2023, 17(1): 1-26.
ZHANG L, LU T L, DU Y H. Overview of facial deepfake video detection methods[J]. Journal of Frontiers of Computer Science & Technology, 2023, 17(1): 1-26.
[28] JIANG L, LI R, WU W, et al. DeeperForensics-1.0: a large-scale dataset for real-world face forgery detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Pisca-taway: IEEE, 2020: 2889-2898.
[29] ROSSLER A, COZZOLINO D, VERDOLIVA L, et al. FaceForensics++: learning to detect manipulated facial images[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019.?Piscataway: IEEE, 2019: 1-11.
[30] LI Y, YANG X, SUN P, et al. Celeb-DF: a large-scale challenging dataset for deepfake forensics[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 3207-3216.
[31] DOLHANSKY B, BITTON J, PFLAUM B, et al. The deepfake detection challenge (DFDC) dataset[EB/OL]. [2023-08-20]. https://arxiv.org/abs/2006.07397.
[32] ZI B, CHANG M, CHEN J, et al. WildDeepfake: a challenging real-world dataset for deepfake detection[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle, Oct 12-16, 2020.?New York: ACM, 2020: 2382-2390.
[33] WODAJO D, ATNAFU S. Deepfake video detection using convolutional vision transformer[EB/OL]. [2023-08-20]. https://arxiv.org/abs/2102.11126.
[34] ZHAO H, ZHOU W, CHEN D, et al. Multi-attentional deepfake detection[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 2185-2194.
[35] 李颖, 边山, 王春桃, 等. CNN结合Transformer的深度伪造高效检测[J]. 中国图象图形学报, 2023, 28(3): 804-819.
LI Y, BIAN S, WANG C T, et al. CNN and Transformer-coordinated deepfake detection[J]. Journal of Image and Graphics, 2023, 28(3): 804-819.
[36] CHEN L, ZHANG Y, SONG Y, et al. Self-supervised learning of adversarial example: towards good generalizations for deepfake detection[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022.?Piscataway: IEEE, 2022: 18710-18719.