计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (7): 1255-1264.DOI: 10.3778/j.issn.1673-9418.2006057

• 人工智能 • 上一篇    下一篇

联合总变率空间和时延神经网络的说话人识别

瞿于荃,龙华,段荧,邵玉斌,杜庆治   

  1. 1. 昆明理工大学 信息工程与自动化学院,昆明 650000
    2. 昆明理工大学 云南省计算机国家重点实验室,昆明 650000
  • 出版日期:2021-07-01 发布日期:2021-07-09

Speaker Verification Combining Total Variability Space and Time Delay Neural Network

QU Yuquan, LONG Hua, DUAN Ying, SHAO Yubin, DU Qingzhi   

  1. 1. College of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650000, China
    2. National Key Laboratory of Computer Science of Yunnan Province, Kunming University of Science and Techno-logy, Kunming 650000, China
  • Online:2021-07-01 Published:2021-07-09

摘要:

在短语音环境下,总变率空间对语音概率分布估计不足,导致说话人识别性能下降。针对上述问题,提出一种基于总变率空间和时延神经网络(TDNN)的增强说话人身份向量的方法。目的是学习总变率空间和时延神经网络的线性相关性,同时提取说话人嵌入向量并投影在新的空间上,组合成新的说话人超向量来增强说话人信息。训练阶段,分别训练总变率空间和时延神经网络,重新组建一个无关说话人集,从中提取身份向量和x向量并在典型关联分析(CCA)下得到投影矩阵;注册和测试阶段,抽取注册和测试说话人的嵌入向量,通过投影矩阵映射在新空间中,然后组合向量增强说话人身份信息。实验表明,在短注册时长和短测试时长下,融合的新向量比基线身份向量、x向量在等误差率上都有明显下降。

关键词: 总变率空间, 时延神经网络(TDNN), 典型关联分析(CCA), 短语音

Abstract:

Under the short utterance environment, the total variability space underestimates the distribution of speech probabilities, which leads to a decline in speaker verification performance. Aiming at the above problems, a method of enhancing speaker identity vectors based on total variability space and time delay neural network (TDNN) is proposed. The purpose is to learn the linear correlation between the total variability space and TDNN, extract the speaker embeddings and project them on the new space, and then combine them into a new speaker supervector in order to enhance speaker information. In the training phase, this method separately trains the total variability space and TDNN. It creates a new irrelevant speaker set, extracts the i-vector and x-vector from it and gets the projection matrix under canonical correlation analysis (CCA). In the registration and testing phase, the embeddings of the registration and testing speakers are extracted, mapped in a new space through the projection matrix, and then the combined vectors enhance the speaker identity information. Under the short registration utterance and short test utterance, the experiment shows that the fused new vector is significantly lower than the baseline i-vector, x-vector in equal error rate.

Key words: total variability space, time delay neural network (TDNN), canonical correlation analysis (CCA), short utterance