改进深度残差收缩网络的端到端合成语音检测

doi:10.3778/j.issn.1673-9418.2404088

摘要/Abstract

摘要： 合成语音的滥用导致了诸多现实问题，研究相应的鉴伪技术对于保护公民人身财产安全、保障社会与国家安全具有重大意义。传统的合成语音检测多采用手工设计特征与后端分类器相结合的方式，前端手工特征设计涉及复杂的先验知识，使用单一手工特征模型检测效果不理想，而进行多特征融合则导致模型参数量较大。同时，目前多数检测方法还存在跨数据集泛化性差的问题。为解决上述问题，提出了一种基于改进深度残差收缩网络的端到端合成语音检测方法。融合通道注意力机制重新设计自适应阈值学习模块，提高了阈值学习的精度；设计并引入帧注意力机制模块，为不同的帧赋予不同的关注程度，提高了模型的特征选择能力；设计并引入了具有两种超参数的改进小波阈值函数，增强阈值化模块抑制无关特征的能力；设计了一种基于改进深度残差收缩网络端到端合成语音检测网络，输入原始语音即可判别其是否为合成语音。基于ASVspoof2019 LA数据集的对比实验结果显示，所提方法将基线模型的等错误率与最小串联检测成本函数分别降低了85%与84%。基于ASVspoof2015 LA数据集的跨库测试结果验证了所提方法的泛化性能。

关键词: 合成语音检测, 深度残差收缩网络, 帧注意力, 小波阈值函数

Abstract: The misuse of synthetic speech has led to numerous real-world problems. Researching corresponding anti-counterfeiting techniques is of great significance for protecting the personal and property safety of citizens and ensuring social and national security. Traditional synthetic speech detection often combines manually designed features with backend classifiers. The manual front-end features involve complex prior knowledge, and using a single manual feature model yields unsatisfactory detection results. However, fusing multiple features leads to a large number of model parameters. Moreover, most detection methods suffer from poor generalization across datasets. To address these issues, an end-to-end synthetic speech detection method based on an improved deep residual contraction network is proposed. Firstly, a channel attention mechanism is integrated to redesign the adaptive threshold learning module, improving the accuracy of threshold learning. Secondly, a frame attention mechanism module is designed and introduced to assign different attention levels to different frames, enhancing the model’s feature selection capability. Then, an improved wavelet threshold function with two hyperparameters is designed and introduced to enhance the ability of the thresholding module to suppress irrelevant features. Finally, an end-to-end synthetic speech detection network based on the improved deep residual contraction network is designed, which can determine whether the input raw speech is synthetic speech. Comparative experimental results based on the ASVspoof2019 LA dataset show that the proposed method reduces the equal error rate and minimum concatenated detection cost function of the baseline model by 85% and 84%, respectively. Cross-database testing results based on the ASVspoof2015 LA dataset validate the generalization performance of the proposed method.

Key words: synthetic speech detection, deep residual shrinkage networks, frame attention, wavelet threshold function

曾高俊, 芦天亮, 任英杰, 李御瑾, 彭舒凡. 改进深度残差收缩网络的端到端合成语音检测[J]. 计算机科学与探索, 2025, 19(4): 1076-1086.

ZENG Gaojun, LU Tianliang, REN Yingjie, LI Yujin, PENG Shufan. End-to-End Synthetic Speech Detection Based on Improved Deep Residual Shrinkage Networks[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(4): 1076-1086.

参考文献

[1] WANG X, YAMAGISHI J. A practical guide to logical access voice presentation attack detection[M/OL]//Frontiers in fake media generation and detection. Singapore: Springer, 2022: 169-214 [2023-12-11]. https://link.springer.com/10.1007/ 978-981-19-1524-6_8.
[2] ZHAO M H, ZHONG S S, FU X Y, et al. Deep residual shrinkage networks for fault diagnosis[J]. IEEE Transactions on Industrial Informatics, 2020, 16(7): 4681-4690.
[3] 周晔, 章坚武, 程继承. 面向复杂声学环境的伪装语音检测[J]. 传感技术学报, 2022, 35(10): 1355-1362.
ZHOU Y, ZHANG J W, CHENG J C. Speech anti-spoofing for complex acoustic environments[J]. Chinese Journal of Sensors and Actuators, 2022, 35(10): 1355-1362.
[4] 任延珍, 刘晨雨, 刘武洋, 等. 语音伪造及检测技术研究综述[J]. 信号处理, 2021, 37(12): 2412-2439.
REN Y Z, LIU C Y, LIU W Y, et al. A survey on speech forgery and detection[J]. Journal of Signal Processing, 2021, 37(12): 2412-2439.
[5] YANG J C, DAS R K, ZHOU N N. Extraction of octave spectra information for spoofing attack detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(12): 2373-2384.
[6] 徐童心, 黄俊. 基于CNN-Transformer的欺骗语音检测[J]. 无线电工程, 2024, 54(5): 1091-1098.
XU T X, HUANG J. Spoofed speech detection based on CNN-transformer[J]. Radio Engineering, 2024, 54(5): 1091-1098.
[7] SCARDAPANE S, STOFFL L, R?HRBEIN F, et al. On the use of deep recurrent neural networks for detecting audio spoofing attacks[C]//Proceedings of the 2017 International Joint Conference on Neural Networks. Piscataway: IEEE, 2017: 3483-3490.
[8] 何信, 胡金瑶, 艾斯卡尔·艾木都拉, 等. 基于ResNeSt网络的音频欺骗检测[J]. 现代电子技术, 2022, 45(23): 88-92.
HE X, HU J Y, ASKAR HAMDULLA, et al. Audio spoofing detection based on ResNeSt network[J]. Modern Electronics Technique, 2022, 45(23): 88-92.
[9] LAI C I, CHEN N, VILLALBA J, et al. ASSERT: anti-spoofing with squeeze-excitation and residual networks[C]//Proceedings of the Interspeech 2019, 2019: 1013-1017.
[10] WANG Z, CUI S, KANG X, et al. Densely connected convolutional network for audio spoofing detection[C]//Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway: IEEE, 2020: 1352-1360.
[11] MUCKENHIRN H, MAGIMAI-DOSS M, MARCEL S, et al. End-to-end convolutional neural network-based voice presentation attack detection[C]//Proceedings of the 2017 IEEE International Joint Conference on Biometrics. Piscataway: IEEE, 2017: 335-341.
[12] MONTEIRO J, ALAM J, FALK T H, et al. End-to-end detection of attacks to automatic speaker recognizers with time-attentive light convolutional neural networks[C]//Proceedings of the 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing. Piscataway: IEEE, 2019: 1-6.
[13] HUA G, TEOH A B J, ZHANG H J. Towards end-to-end synthetic speech detection[J]. IEEE Signal Processing Letters, 2021, 28: 1265-1269.
[14] 王锦阳, 华光, 黄双. 基于注意力机制的端到端合成语音检测[J]. 信号处理, 2022, 38(9): 1975-1987.
WANG J Y, HUA G, HUANG S. End-to-end synthetic speech detection based on attention mechanism[J]. Journal of Signal Processing, 2022, 38(9): 1975-1987.
[15] 李子愚, 刘兆霆, 姚英彪. 乘性噪声环境和非理想二值信道下的1比特鲁棒参数估计[J]. 传感技术学报, 2019, 32(11): 1730-1737.
LI Z Y, LIU Z T, YAO Y B. One-bit robust parameter estimation in multiplicative noise environment and non-ideal binary channel[J]. Chinese Journal of Sensors and Actuators, 2019, 32(11): 1730-1737.
[16] HU J, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks[EB/OL]. [2024-04-11]. https://arxiv.org/abs/1709. 01507.
[17] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[EB/OL]. [2024-04-13]. https://arxiv.org/abs/1807.06521.
[18] ROY A G, NAVAB N, WACHINGER C, et al. Concurrent spatial and channel squeeze & qexcitation in fully convolutional networks[EB/OL]. [2024-04-08]. https://arxiv.org/abs/ 1803.02579.
[19] TODISCO M, WANG X, VESTMAN V, et al. ASVspoof 2019: future horizons in spoofed and fake audio detection[C]//Proceedings of the Interspeech 2019, 2019: 1008-1012.
[20] KINNUNEN T, LEE K A, DELGADO H, et al. T-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification[EB/OL]. [2024-03-18]. https://arxiv.org/abs/1804.09618.
[21] ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup: beyond empirical risk minimization[EB/OL]. [2024-03-18]. https://arxiv.org/abs/1710.09412.
[22] TAK H, PATINO J, TODISCO M, et al. End-to-end anti-spoofing with RawNet2[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 6369-6373.
[23] LI X, LI N, WENG C, et al. Replay and synthetic speech detection with Res2Net architecture[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 6354-6358.
[24] GE W Y, PATINO J, TODISCO M, et al. Raw differentiable architecture search for speech deepfake and spoofing detection[C]//Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021: 22-28.
[25] GUO D F, ZHU W H, GAO Z M, et al. A study of wavelet thresholding denoising[C]//Proceedings of the 2000 5th International Conference on Signal Processing. the 16th World Computer Congress 2000: vol.1. Piscataway: IEEE, 2000: 329-332.
[26] 火元莲, 张健, 连培君, 等. 基于改进小波阈值函数的闪电电场信号去噪研究[J]. 传感技术学报, 2021, 34(2): 218-222.
HUO Y L, ZHANG J, LIAN P J, et al. Research on lightning electric field signals denoising based on improved wavelet threshold function[J]. Chinese Journal of Sensors and Actuators, 2021, 34(2): 218-222.
[27] 夏翔, 方磊, 方四安, 等. 基于自监督预训练和有监督微调的伪造语音检测方法[J]. 计算机应用, 2023, 43(S1): 263-268.
XIA X, FANG L, FANG S A, et al. Spoofing speech detection method based on self-supervised pre-training and supervised fine-tuning[J]. Journal of Computer Applications, 2023, 43(S1): 263-268.

编辑推荐 0

Metrics

阅读次数

全文

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	1	0	27

	来源	本网站

	次数	28
	比例	100%

摘要

最新录用	在线预览	正式出版

13	0	23

	来源	本网站

	次数	36
	比例	100%