Journal of Frontiers of Computer Science and Technology

• Science Researches •     Next Articles

A Continual Disentanglement Generation for Speech Emotion Recognition

NING Meiling,  QI Jiayin,  LIANG Kuai,  ZHANG Xun,  CHEN Kaifan   

  1. 1. School of Statistics and Information Science, Shanghai University of International Business and Economics, Shanghai 201620, China
    2. College of Cyberspace Security(Huangpu), Guangzhou University, Guangzhou 510006, China
    3. Huangpu Research Institute, Guangzhou University, Guangzhou 510006, China
    4. Key Laboratory of Trustworthy Distributed Computing and Service, Beijing University of Posts and Telecommunications, Beijing 100084, China

面向语音情感识别的持续特征解耦生成方法

宁美玲,齐佳音,梁快,张洵,陈凯帆   

  1. 1. 上海对外经贸大学 统计与信息学院, 上海 201620
    2. 广州大学 网络空间安全学院(黄埔), 广州 510006
    3. 广州大学 黄埔研究院, 广州 510006
    4. 北京邮电大学 可信分布式计算与服务教育部重点实验室, 北京 100084

Abstract: A continual disentanglement generation for speech emotion recognition is proposed to address the problems of the lack of large amounts of labeled training data for speech models and the inability of speech models to learn Incrementally in the field of speech emotion recognition. It can effectively extract the data emotion information and can better perform continual learning classification. Firstly, a parallel selection disentangler is constructed, and Spectrum Selection Module and Content Selection Module are utilized to establish a connection between the spectrum features and content features of speech, and the fusion feature data is generated by calculating the emotion correlation coefficients to give the disentanglement data correlation weights. Secondly, Second-order knowledge flow sentiment classifier is constructed to prevent the model from catastrophic forgetting, fully exploiting and utilizing the generated speech emotion data, a custom L2 normalization layer is introduced, a customized continual speech emotion recognition network (Continual Speech Emotion Recognition Network, CL-SER)is constructed, and a multilayer convolutional structure is utilized to process the speech emotion data to reduce the loss of model error. Finally, the task distillation loss and center of task smoothing loss are used to optimize the continual speech emotion network(CL-SER) to achieve cross-task knowledge migration and improve the model continual classification accuracy. On the IEMOCAP dataset, the performance of generated data, the performance of the model against catastrophic forgetting, and the performance of emotion classification are tested. The experimental results show that the proposed continual speech emotion data generation method demonstrates good performance in terms of accuracy, forgetting rate, and unweighted mean recall in both multi-group continual learning methods and speech emotion recognition methods, which is more advantageous compared to other classical continual learning methods and emotion recognition methods.

Key words: speech emotion recognition, continual learning, disentanglement learning, variational autoencoder, knowledge transfer

摘要: 针对语音情绪识别领域中语音模型缺乏大量带标签的训练数据和语音模型无法持续学习的问题,本工作提出了一种面向语音情感识别的持续特征解耦生成方法,该方法能够有效的提取数据情感信息并能较好的进行持续学习分类。首先,构建并行选择解耦器,利用频谱选择模块和内容选择模块将语音的频谱特征和内容特征建立联系,通过计算情绪相关系数,赋予解耦数据相关权重,生成融合特征数据。其次,构建二阶知识流情绪分类器,充分挖掘利用生成的语音情绪数据,引入自定义L2归一化层,构建自定义持续语音情绪分类网络(Continual Speech Emotion Recognition Network, CL-SER),利用多层卷积结构处理语音情绪数据,减少模型误差损失。最后,利用任务蒸馏损失和任务平滑损失优化持续语音情感网络(CL-SER),实现跨任务的知识迁移,提高模型持续分类准确率。在IEMOCAP数据集上,进行了生成数据性能、模型防止灾难性遗忘性能和情绪分类性能测试。实验结果表明,提出的持续语音情感数据生成方法在多组持续学习方法和语音情感识别方法中均展现出了在准确率、遗忘率和未加权平均召回率等方面的良好性能,相较于其他经典的持续学习方法和情绪识别方法而言更具优势。

关键词: 语音情感识别, 持续学习, 特征解耦学习, 变分自编码器, 知识迁移