计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (4): 1076-1086.DOI: 10.3778/j.issn.1673-9418.2404088

• 人工智能·模式识别 • 上一篇    下一篇

改进深度残差收缩网络的端到端合成语音检测

曾高俊,芦天亮,任英杰,李御瑾,彭舒凡   

  1. 1. 中国人民公安大学 信息网络安全学院,北京 100038
    2. 公安部 网络安全保卫局,北京 100741
  • 出版日期:2025-04-01 发布日期:2025-03-28

End-to-End Synthetic Speech Detection Based on Improved Deep Residual Shrinkage Networks

ZENG Gaojun, LU Tianliang, REN Yingjie, LI Yujin, PENG Shufan   

  1. 1. School of Information and Cyber Security, People’s Public Security University of China, Beijing 100038, China
    2. Network Security Bureau, Ministry of Public Security of the People’s Republic of China, Beijing 100741, China
  • Online:2025-04-01 Published:2025-03-28

摘要: 合成语音的滥用导致了诸多现实问题,研究相应的鉴伪技术对于保护公民人身财产安全、保障社会与国家安全具有重大意义。传统的合成语音检测多采用手工设计特征与后端分类器相结合的方式,前端手工特征设计涉及复杂的先验知识,使用单一手工特征模型检测效果不理想,而进行多特征融合则导致模型参数量较大。同时,目前多数检测方法还存在跨数据集泛化性差的问题。为解决上述问题,提出了一种基于改进深度残差收缩网络的端到端合成语音检测方法。融合通道注意力机制重新设计自适应阈值学习模块,提高了阈值学习的精度;设计并引入帧注意力机制模块,为不同的帧赋予不同的关注程度,提高了模型的特征选择能力;设计并引入了具有两种超参数的改进小波阈值函数,增强阈值化模块抑制无关特征的能力;设计了一种基于改进深度残差收缩网络端到端合成语音检测网络,输入原始语音即可判别其是否为合成语音。基于ASVspoof2019 LA数据集的对比实验结果显示,所提方法将基线模型的等错误率与最小串联检测成本函数分别降低了85%与84%。基于ASVspoof2015 LA数据集的跨库测试结果验证了所提方法的泛化性能。

关键词: 合成语音检测, 深度残差收缩网络, 帧注意力, 小波阈值函数

Abstract: The misuse of synthetic speech has led to numerous real-world problems. Researching corresponding anti-counterfeiting techniques is of great significance for protecting the personal and property safety of citizens and ensuring social and national security. Traditional synthetic speech detection often combines manually designed features with backend classifiers. The manual front-end features involve complex prior knowledge, and using a single manual feature model yields unsatisfactory detection results. However, fusing multiple features leads to a large number of model parameters. Moreover, most detection methods suffer from poor generalization across datasets. To address these issues, an end-to-end synthetic speech detection method based on an improved deep residual contraction network is proposed. Firstly, a channel attention mechanism is integrated to redesign the adaptive threshold learning module, improving the accuracy of threshold learning. Secondly, a frame attention mechanism module is designed and introduced to assign different attention levels to different frames, enhancing the model’s feature selection capability. Then, an improved wavelet threshold function with two hyperparameters is designed and introduced to enhance the ability of the thresholding module to suppress irrelevant features. Finally, an end-to-end synthetic speech detection network based on the improved deep residual contraction network is designed, which can determine whether the input raw speech is synthetic speech. Comparative experimental results based on the ASVspoof2019 LA dataset show that the proposed method reduces the equal error rate and minimum concatenated detection cost function of the baseline model by 85% and 84%, respectively. Cross-database testing results based on the ASVspoof2015 LA dataset validate the generalization performance of the proposed method.

Key words: synthetic speech detection, deep residual shrinkage networks, frame attention, wavelet threshold function