联合总变率空间和时延神经网络的说话人识别

doi:10.3778/j.issn.1673-9418.2006057

摘要/Abstract

摘要：

在短语音环境下，总变率空间对语音概率分布估计不足，导致说话人识别性能下降。针对上述问题，提出一种基于总变率空间和时延神经网络（TDNN）的增强说话人身份向量的方法。目的是学习总变率空间和时延神经网络的线性相关性，同时提取说话人嵌入向量并投影在新的空间上，组合成新的说话人超向量来增强说话人信息。训练阶段，分别训练总变率空间和时延神经网络，重新组建一个无关说话人集，从中提取身份向量和x向量并在典型关联分析（CCA）下得到投影矩阵；注册和测试阶段，抽取注册和测试说话人的嵌入向量，通过投影矩阵映射在新空间中，然后组合向量增强说话人身份信息。实验表明，在短注册时长和短测试时长下，融合的新向量比基线身份向量、x向量在等误差率上都有明显下降。

关键词: 总变率空间, 时延神经网络（TDNN）, 典型关联分析（CCA）, 短语音

Abstract:

Under the short utterance environment, the total variability space underestimates the distribution of speech probabilities, which leads to a decline in speaker verification performance. Aiming at the above problems, a method of enhancing speaker identity vectors based on total variability space and time delay neural network (TDNN) is proposed. The purpose is to learn the linear correlation between the total variability space and TDNN, extract the speaker embeddings and project them on the new space, and then combine them into a new speaker supervector in order to enhance speaker information. In the training phase, this method separately trains the total variability space and TDNN. It creates a new irrelevant speaker set, extracts the i-vector and x-vector from it and gets the projection matrix under canonical correlation analysis (CCA). In the registration and testing phase, the embeddings of the registration and testing speakers are extracted, mapped in a new space through the projection matrix, and then the combined vectors enhance the speaker identity information. Under the short registration utterance and short test utterance, the experiment shows that the fused new vector is significantly lower than the baseline i-vector, x-vector in equal error rate.

Key words: total variability space, time delay neural network (TDNN), canonical correlation analysis (CCA), short utterance

瞿于荃, 龙华, 段荧, 邵玉斌, 杜庆治. 联合总变率空间和时延神经网络的说话人识别[J]. 计算机科学与探索, 2021, 15(7): 1255-1264.

QU Yuquan, LONG Hua, DUAN Ying, SHAO Yubin, DU Qingzhi. Speaker Verification Combining Total Variability Space and Time Delay Neural Network[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(7): 1255-1264.

参考文献

[1] KINNUNEN T, LI H. An overview of text-independent spea-ker recognition: from features to supervectors[J]. Speech Com-munication, 2010, 52(1): 12-40.
[2] REYNOLDS D A, QUATIERI T F, DUNN R B. Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Processing, 2000, 10(1): 19-41.
[3] KENNY P, BOULIANNE G, DUMOUCHEL P. Eigenvoice modeling with sparse training data[J]. IEEE Transactions on Speech and Audio Processing, 2005, 13(3): 345-354.
[4] DEHAK N, KENNY P J, DEHAK R, et al. Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio, Speech and Language Processing, 2011, 19(4): 788-798.
[5] SUN N, ZHANG Y, LIN H B, et al. Short speech speaker recognition algorithm based on multi feature i-vector[J]. Computer Application, 2018, 38(10): 93-97.
孙念, 张毅, 林海波, 等. 基于多特征i-vector的短语音说话人识别算法[J]. 计算机应用, 2018, 38(10): 93-97.
[6] FAROOQ M U, ADEEBA F, HUSSAIN S. X-vectors based Urdu speaker identification for short utterances[C]//Procee-dings of the 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standar-disation of Speech Databases and Assessment Techniques, Cebu, Oct 25-27, 2019. Piscataway: IEEE, 2019: 1-5.
[7] WANG Z, FU S. Short speech speaker verification based on improved identity vector extraction[J]. Journal of Chinese Computer Systems, 2019, 40(11): 2264-2268.
王铮, 傅山. 基于改进身份向量提取的短语音说话人确认[J]. 小型微型计算机系统, 2019, 40(11): 2264-2268.
[8] SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C]//Proceedings of the 18th Annual Conference of the International Speech Communication Association, Sto-ckholm, Aug 20-24, 2017: 999-1003.
[9] CAI G D. Research on speaker recognition based on x-vector[D]. Beijing: Beijing Jiaotong University, 2019.
蔡国都. 基于x-vector的说话人识别研究[D]. 北京: 北京交通大学, 2019.
[10] LIN L, CHEN H, CHEN J, et al. Short speech speaker reco-gnition based on multi-core SVM-GMM[J]. Journal of Jilin University (Engineering Edition), 2013, 43(2): 504-509.
林琳, 陈虹, 陈建, 等. 基于多核SVM-GMM的短语音说话人识别[J]. 吉林大学学报(工学版), 2013, 43(2): 504-509.
[11] BHATTACHARYA G, ALAM M J, KENNY P. Deep speaker embeddings for short duration speaker verification[C]//Pro-ceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Aug 20-24, 2017: 1517-1521.
[12] ZHANG J C, INOUE N, SHINODA K. I-vector transforma-tion using conditional generative adversarial networks for short utterance speaker verification[C]//Proceedings of the 19th Annual Conference of the International Speech Com-munication Association, Hyderabad, Sep 2-6, 2018: 3613-3617.
[13] JUNG Y, KYE S M, CHOI Y, et al. Improving multi-scale aggregation using feature pyramid module for robust speaker verification of variable-duration utterances[J]. arXiv:2004. 03194, 2020.
[14] SNYDER D, GHAHREMANI P, POVEY D, et al. Deep neural network-based speaker embeddings for end-to-end speaker verification[C]//Proceedings of the 2016 IEEE Spo-ken Language Technology Workshop, San Diego, Dec 13-16, 2016. Piscataway: IEEE, 2016: 165-170.
[15] MATROUF D, SCHEFFER N, FAUVE B G B, et al. A straightforward and efficient implementation of the factor analysis model for speaker verification[C]//Proceedings of the 8th Annual Conference of the International Speech Com-munication Association, Antwerp, Aug 27-31, 2007: 1242-1245.
[16] PEROZZI B, AL-RFOU R, SKIENA S, et al. DeepWalk: online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining, New York, Aug 24-27, 2014. New York: ACM, 2014: 701-710.
[17] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:1301.3781, 2013.
[18] WAIBEL A, HANAZAWA T, HINTON G, et al. Phoneme recognition using time-delay neural networks[J]. IEEE Trans-actions on Acoustics, Speech, and Signal Processing, 2002, 37(3): 328-339.
[19] PRINCE S J D, ELDER J H. Probabilistic linear discri-minant analysis for inferences about identity[C]//Proceedings of the 11th International Conference on Computer Vision, Rio de Janeiro, Oct 14-20, 2007. Washington: IEEE Com-puter Society, 2007: 1-8.
[20] HARDOON D R, SZEDMAK S, SHAWE-TAYLOR J. Canonical correlation analysis: an overview with application to learning methods[J]. Neural Computation, 2004, 16(12): 2639-2664.
[21] SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-vectors: robust DNN embeddings for speaker recognition[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Apr 15-20, 2018. Piscataway: IEEE, 2018: 5329-5333.