多层次通道融合语音情感识别方法

doi:10.3778/j.issn.1673-9418.2502050

摘要/Abstract

摘要： 语音情感识别是机器情感认知能力的关键，对于提高人机交互质量至关重要。然而，现有研究多聚焦于浅层特征的分析，忽略了多特征融合的优势，同时数据样本量有限，影响了模型的泛化能力，导致语音情感识别准确率不够理想。为了进一步提高语音情感识别的准确率，提出一种基于数据增强和多层次通道融合的语音情感识别方法。将原始语音加入高斯白噪声、音高转换和混合处理三种方法进行数据增强，提高模型的鲁棒性。提出一种基于wav2vec 2.0模型和CNN模型的多层次并行通道网络结构。其中，第一个通道采用wav2vec 2.0模型作为主干网络，学习语音数据的深层表征，再经过两层卷积的CNN模型进行计算；第二个通道提取语音情感浅层特征作为输入，采用五层卷积的CNN模型学习语音数据的浅层表征，更全面地分析语音数据的深层表征和浅层表征。将两个通道输出的表征进行融合，形成深浅结合的多层次语音情感特征体系。所提出的模型在RAVDESS和CASIA数据集上分别进行测试，准确率达到94.38%和98.75%，实验结果验证了所提方法的有效性。

关键词: 语音情感识别, 多层次通道融合, wav2vec 2.0, 卷积神经网络（CNN）

Abstract: Speech emotion recognition is key to the emotional cognitive ability of machine and crucial for improving human-machine interaction quality. However, most of the existing studies focus on the analysis of shallow features, ignoring the advantages of multi-feature fusion. Additionally, the limited size of data samples affects models?? generalization capability, thereby resulting in suboptimal accuracy in speech emotion recognition. In order to further improve the accuracy of speech emotion recognition, a method based on data augmentation and multi-level channel fusion is proposed. Firstly, the original speech is enhanced using Gaussian white noise, pitch shift, and mixed processing, to improve the model??s robustness. Secondly, based on the wav2vec 2.0 model and CNN model, a multi-level parallel channel network is proposed. The first channel uses the wav2vec 2.0 model as the backbone network to learn deep representations of speech data, followed by two layers of convolutional CNN model for computation. The second channel extracts shallow emotion features of speech as input and uses a five-layer convolutional CNN model to learn the shallow representations, enabling more comprehensive analysis of both deep and shallow representations in speech data. Finally, the representations from both channels are fused, forming a multi-level speech emotion feature system that combines deep and shallow features. The proposed model is tested on the RAVDESS and CASIA datasets, achieving accuracy of 94.38% and 98.75%, respectively, validating the effectiveness of the proposed approach.

Key words: speech emotion recognition, multi-level channel fusion, wav2vec 2.0, convolutional neural network (CNN)

张丽敏, 李扬, 蔡浩, 燕浩. 多层次通道融合语音情感识别方法[J]. 计算机科学与探索, 2025, 19(8): 2219-2228.

ZHANG Limin, LI Yang, CAI Hao, YAN Hao. Multi-level Channel Fusion Method for Speech Emotion Recognition[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(8): 2219-2228.

参考文献

[1] 张会云, 黄鹤鸣, 李伟, 等. 语音情感识别研究综述[J]. 计算机仿真, 2021, 38(8): 7-17.
ZHANG H Y, HUANG H M, LI W, et al. An overview of speech emotion recognition[J]. Computer Simulation, 2021, 38(8): 7-17.
[2] 孙晓虎, 李洪均. 语音情感识别综述[J]. 计算机工程与应用, 2020, 56(11): 1-9.
SUN X H, LI H J. Overview of speech emotion recognition[J]. Computer Engineering and Applications, 2020, 56(11): 1-9.
[3] LIU J J, WU X F. Prototype of educational affective arousal evaluation system based on facial and speech emotion recognition[J]. International Journal of Information and Education Technology, 2019, 9(9): 645-651.
[4] LI H C, PAN T, LEE M H, et al. Make patient consultation warmer: a clinical application for speech emotion recognition[J]. Applied Sciences, 2021, 11(11): 4782.
[5] BADSHAH A M, RAHIM N, ULLAH N, et al. Deep features-based speech emotion recognition for smart affective services[J]. Multimedia Tools and Applications, 2019, 78(5): 5571-5589.
[6] TAN L, YU K P, LIN L, et al. Speech emotion recognition enhanced traffic efficiency solution for autonomous vehicles in a 5G-enabled space-air-ground integrated intelligent transportation system[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(3): 2830-2842.
[7] NASRI H, OUARDA W, ALIMI A M. ReLiDSS: novel lie detection system from speech signal[C]//Proceedings of the 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications. Piscataway: IEEE, 2016: 1-8.
[8] 崔晨露, 崔琳. 面向数据增强的轻量化语音情感识别[J]. 计算机与现代化, 2023(4): 83-89.
CUI C L, CUI L. Lightweight speech emotion recognition for data enhancement[J]. Computer and Modernization, 2023(4): 83-89.
[9] RAYHAN AHMED M, ISLAM S, MUZAHIDUL ISLAM A K M, et al. An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition[J]. Expert Systems with Applications, 2023, 218: 119633.
[10] 李茜茜, 沈晓燕, 任福继, 等. 面向数据增强的多种语音情感分类算法研究[J]. 智能系统学报, 2021, 16(1): 170-177.
LI Q Q, SHEN X Y, REN F J, et al. Investigation of multiple speech emotion classification algorithms based on data enhancement[J]. CAAI Transactions on Intelligent Systems, 2021, 16(1): 170-177.
[11] TU Z W, LIU B, ZHAO W, et al. A feature fusion model with data augmentation for speech emotion recognition[J]. Applied Sciences, 2023, 13(7): 4124.
[12] YI L, MAK M W. Adversarial data augmentation network for speech emotion recognition[C]//Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway: IEEE, 2019: 529-534.
[13] PADI S, MANOCHA D, SRIRAM R D. Multi-window data augmentation approach for speech emotion recognition[EB/OL]. [2024-12-15]. https://arxiv.org/abs/2010.09895.
[14] SINGH P, SRIVASTAVA R, RANA K P S, et al. A multimodal hierarchical approach to speech emotion recognition from audio and text[J]. Knowledge-Based Systems, 2021, 229: 107316.
[15] CHEN Z Z, LIN M T, WANG Z F, et al. Spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms[J]. Knowledge-Based Systems, 2023, 281: 111077.
[16] BHANGALE K B, KOTHANDARAMAN M. Speech emotion recognition using the novel PEmoNet (parallel emotion network)[J]. Applied Acoustics, 2023, 212: 109613.
[17] ATILA O, ?ENGüR A. Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition[J]. Applied Acoustics, 2021, 182: 108260.
[18] MORAIS E, HOORY R, ZHU W Z, et al. Speech emotion recognition using self-supervised features[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 6922-6926.
[19] CAI X Y, YUAN J H, ZHENG R J, et al. Speech emotion recognition with multi-task learning[C]//Proceedings of the Interspeech 2021, 2021: 4508-4512.
[20] 杨锁荣, 杨洪朝, 申富饶, 等. 面向深度学习的图像数据增强综述[J]. 软件学报, 2025, 36(3): 1390-1412.
YANG S R, YANG H C, SHEN F R, et al. Image data augmentation for deep learning: a survey[J]. Journal of Software, 2025, 36(3): 1390-1412.
[21] BAUTISTA J L, LEE Y K, SHIN H S. Speech emotion recognition based on parallel CNN-attention networks with multi-fold data augmentation[J]. Electronics, 2022, 11(23): 3935.
[22] JESTEADT W, NEFF D L. A signal-detection-theory measure of pitch shifts in sinusoids as a function of intensity[J]. The Journal of the Acoustical Society of America, 1982, 72(6): 1812-1820.
[23] ZHANG J, JIA H. Design of speech corpus for mandarin text to speech[C]//Proceedings of the Blizzard Challenge 2008, 2008.
[24] LIVINGSTONE S R, RUSSO F A. The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English[J]. PLoS One, 2018, 13(5): e0196391.
[25] ZHU R F, SUN C X, WEI X P, et al. Speech emotion recognition using channel attention mechanism[C]//Proceedings of the 2023 4th International Conference on Computer Engineering and Application. Piscataway: IEEE, 2023: 680-684.
[26] WANG Z Y, GUO X. Research on mandarin Chinese in speech emotion recognition[C]//Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing. New York: ACM, 2022: 99-103.
[27] 张少华, 冯炎, 余仁杰, 等. 基于SE注意力机制和深度卷积的语音情感识别[J]. 现代电子技术, 2024, 47(22): 64-70.
ZHANG S H, FENG Y, YU R J, et al. Speech emotion recognition based on SE attention mechanism and deep convolution[J]. Modern Electronics Technique, 2024, 47(22): 64-70.
[28] 杜晨阳, 张雪英, 黄丽霞, 等. 基于改进高效通道注意力机制的多特征语音情感识别[J]. 计算机工程, 2025, 51(4): 97-106.
DU C Y, ZHANG X Y, HUANG L X, et al. Multi-feature speech emotion recognition based on improved efficient channel attention mechanism[J]. Computer Engineering, 2025, 51(4): 97-106.
[29] PATEL N, PATEL S, MANKAD S H. Impact of autoencoder based compact representation on emotion detection from audio[J]. Journal of Ambient Intelligence and Humanized Computing, 2022, 13(2): 867-885.
[30] DUTT A, GADER P. Wavelet multiresolution analysis based speech emotion recognition system using 1D CNN LSTM networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 2043-2054.
[31] ONG K L, LEE C P, LIM H S, et al. Mel-MViTv2: enhanced speech emotion recognition with Mel spectrogram and improved multiscale vision transformers[J]. IEEE Access, 2023, 11: 108571-108579.
[32] 张雨萌, 张欣, 高谋, 等. 融合动态卷积和注意力机制的多层感知机语音情感识别[J]. 计算机科学与探索, 2025, 19(4): 1065-1075.
ZHANG Y M, ZHANG X, GAO M, et al. Incorporating dynamic convolution and attention mechanism in multilayer perceptron for speech emotion recognition[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(4): 1065-1075.