计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (8): 2219-2228.DOI: 10.3778/j.issn.1673-9418.2502050

• 人工智能·模式识别 • 上一篇    下一篇

多层次通道融合语音情感识别方法

张丽敏,李扬,蔡浩,燕浩   

  1. 西安外国语大学 人工智能与语言认知神经科学重点实验室,西安 710128
  • 出版日期:2025-08-01 发布日期:2025-07-31

Multi-level Channel Fusion Method for Speech Emotion Recognition

ZHANG Limin, LI Yang, CAI Hao, YAN Hao   

  1. Key Laboratory for Artificial Intelligence and Cognitive Neuroscience of Language, Xi'an International Studies University, Xi'an 710128, China
  • Online:2025-08-01 Published:2025-07-31

摘要: 语音情感识别是机器情感认知能力的关键,对于提高人机交互质量至关重要。然而,现有研究多聚焦于浅层特征的分析,忽略了多特征融合的优势,同时数据样本量有限,影响了模型的泛化能力,导致语音情感识别准确率不够理想。为了进一步提高语音情感识别的准确率,提出一种基于数据增强和多层次通道融合的语音情感识别方法。将原始语音加入高斯白噪声、音高转换和混合处理三种方法进行数据增强,提高模型的鲁棒性。提出一种基于wav2vec 2.0模型和CNN模型的多层次并行通道网络结构。其中,第一个通道采用wav2vec 2.0模型作为主干网络,学习语音数据的深层表征,再经过两层卷积的CNN模型进行计算;第二个通道提取语音情感浅层特征作为输入,采用五层卷积的CNN模型学习语音数据的浅层表征,更全面地分析语音数据的深层表征和浅层表征。将两个通道输出的表征进行融合,形成深浅结合的多层次语音情感特征体系。所提出的模型在RAVDESS和CASIA数据集上分别进行测试,准确率达到94.38%和98.75%,实验结果验证了所提方法的有效性。

关键词: 语音情感识别, 多层次通道融合, wav2vec 2.0, 卷积神经网络(CNN)

Abstract: Speech emotion recognition is key to the emotional cognitive ability of machine and crucial for improving human-machine interaction quality. However, most of the existing studies focus on the analysis of shallow features, ignoring the advantages of multi-feature fusion. Additionally, the limited size of data samples affects models?? generalization capability, thereby resulting in suboptimal accuracy in speech emotion recognition. In order to further improve the accuracy of speech emotion recognition, a method based on data augmentation and multi-level channel fusion is proposed. Firstly, the original speech is enhanced using Gaussian white noise, pitch shift, and mixed processing, to improve the model??s robustness. Secondly, based on the wav2vec 2.0 model and CNN model, a multi-level parallel channel network is proposed. The first channel uses the wav2vec 2.0 model as the backbone network to learn deep representations of speech data, followed by two layers of convolutional CNN model for computation. The second channel extracts shallow emotion features of speech as input and uses a five-layer convolutional CNN model to learn the shallow representations, enabling more comprehensive analysis of both deep and shallow representations in speech data. Finally, the representations from both channels are fused, forming a multi-level speech emotion feature system that combines deep and shallow features. The proposed model is tested on the RAVDESS and CASIA datasets, achieving accuracy of 94.38% and 98.75%, respectively, validating the effectiveness of the proposed approach.

Key words: speech emotion recognition, multi-level channel fusion, wav2vec 2.0, convolutional neural network (CNN)