融合动态卷积和注意力机制的多层感知机语音情感识别

doi:10.3778/j.issn.1673-9418.2406008

摘要/Abstract

摘要： 语音情感识别技术通过分析语音信号推断说话者情绪，增强人机交互的自然性和智能性。然而，现有模型往往忽视时频语义信息，影响识别准确性。为此，提出了一种融合动态卷积与注意力机制的多层感知机模型，显著提高了情感识别的准确度及信息利用效率。将输入的语音信号转化为梅尔频谱图，捕捉信号细节变化，更贴切地反映人类对声音的感知，为后续特征提取奠定了基础。通过词元化处理将梅尔频谱图转化为词元，降低了数据的复杂性。借助动态卷积与分离注意力机制高效提取关键的时频特征。一方面，动态卷积能够适应不同时间和频率上的尺度变化，优化了特征捕捉效率；另一方面，分离注意力机制增强了模型对关键信息的聚焦能力，有效提升了模型对特征的表达能力。结合动态卷积与分离注意力机制的优势，该模型能够更加充分地提取关键声学特征，从而实现了更高效、更精准的情感识别。在RAVDESS、EmoDB和CASIA三个语音情感数据库上的测试显示，模型识别准确率显著优于现有技术，达到86.11%、95.33%和82.92%。这验证了模型在复杂情感识别任务的高效性和准确性，以及动态卷积和注意力机制的有效性。

关键词: 语音情感识别, 梅尔频谱图, 多层感知机, 动态卷积, 注意力机制

Abstract: Speech emotion recognition technology infers the speaker’s emotions by analyzing the vocal signals, enhancing the naturalness and intelligence of human-computer interaction. However, existing models often overlook the semantic information of time and frequency, affecting the recognition accuracy. To address this problem, a multi-layer perceptron model that integrates dynamic convolution and attention mechanisms has been proposed, significantly improving the accuracy of emotion recognition and the efficiency of information utilization. Firstly, the input speech signals are transformed into a Mel-spectrogram to capture detailed signal variations and more accurately reflect human perception of sound, laying foundation for subsequent feature extraction. The Mel-spectrogram is then tokenized to reduce data complexity. Next, dynamic convolution and split attention mechanisms are employed to extract key temporal-frequency features efficiently. Dynamic convolution adapts to scale changes across different time and frequency domains, thereby enhancing the efficiency of capturing features. Meanwhile, the split attention mechanism enhances the ability of the model to focus on crucial information, effectively improving the feature expressive capability. By combining the advantages of dynamic convolution and split attention mechanisms, the proposed model can fully extract crucial acoustic features, thereby achieving more efficient and accurate emotion recognition. Experiments conducted on the RAVDESS, EmoDB, and CASIA speech emotion databases show that the recognition accuracy of the proposed model significantly surpasses existing technologies, reaching 86.11%, 95.33%, and 82.92%. This verifies the effectiveness of the proposed model in complex emotion recognition tasks, as well as the efficacy of dynamic convolution and attention mechanisms.

Key words: speech emotion recognition, Mel-spectrogram, multi-layer perceptron, dynamic convolution, attention mechanism

张雨萌, 张欣, 高谋, 赵虎林. 融合动态卷积和注意力机制的多层感知机语音情感识别[J]. 计算机科学与探索, 2025, 19(4): 1065-1075.

ZHANG Yumeng, ZHANG Xin, GAO Mou, ZHAO Hulin. Incorporating Dynamic Convolution and Attention Mechanism in Multilayer Perceptron for Speech Emotion Recognition[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(4): 1065-1075.

参考文献

[1] 赵小蕾, 毛启容, 詹永照. 融合功能性副语言的语音情感识别新方法[J]. 计算机科学与探索, 2014, 8(2): 186-199.
ZHAO X L, MAO Q R, ZHAN Y Z. New method of speech emotion recognition fusing functional paralanguages[J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(2): 186-199.
[2] 吴江, 黄茜, 贺超城, 等. 基于引爆点理论的人工智能生成内容微博网络舆情传播与演化分析[J]. 现代情报, 2023, 43(7): 145-161.
WU J, HUANG Q, HE C C, et al. Propagation and evolution of public opinion in the outbreak of AIGC based on the theory of tipping point[J]. Journal of Modern Information, 2023, 43(7): 145-161.
[3] 陶建华, 陈俊杰, 李永伟. 语音情感识别综述[J]. 信号处理, 2023, 39(4): 571-587.
TAO J H, CHEN J J, LI Y W. Review on speech emotion recognition[J]. Journal of Signal Processing, 2023, 39(4): 571-587.
[4] 黄鲁成, 薛爽. 机器学习技术发展现状与国际竞争分析[J]. 现代情报, 2019, 39(10): 165-176.
HUANG L C, XUE S. The development status and international competition analysis of machine learning[J]. Journal of Modern Information, 2019, 39(10): 165-176.
[5] WANKHADE M, RAO A C S, KULKARNI C. A survey on sentiment analysis methods, applications, and challenges[J]. Artificial Intelligence Review, 2022, 55(7): 5731-5780.
[6] 刘玉文, 刘月华, 杨枢, 等. 基于OTSCM模型的主题情感在线追踪[J]. 现代情报, 2017, 37(12): 35-41.
LIU Y W, LIU Y H, YANG S, et al. OTSCM approach for tracking on-line sentiment of topic[J]. Journal of Modern In-formation, 2017, 37(12): 35-41.
[7] 赵小明, 杨轶娇, 张石清. 面向深度学习的多模态情感识别研究进展[J]. 计算机科学与探索, 2022, 16(7): 1479-1503.
ZHAO X M, YANG Y J, ZHANG S Q. Survey of deep learning based multimodal emotion recognition[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1479-1503.
[8] 饶元, 吴连伟, 王一鸣, 等. 基于语义分析的情感计算技术研究进展[J]. 软件学报, 2018, 29(8): 2397-2426.
RAO Y, WU L W, WANG Y M, et al. Research progress on emotional computation technology based on semantic analysis[J]. Journal of Software, 2018, 29(8): 2397-2426.
[9] 赵永, 焦诗卉, 赵乾百. 基于Mel频谱和LSTM-DCNN的矿山微震信号混合识别模型[J]. 东北大学学报(自然科学版), 2023, 44(10): 1481-1489.
ZHAO Y, JIAO S H, ZHAO Q B. Hybrid recognition model of microseismic signals for mining based on Mel spectrum and LSTM-DCNN[J]. Journal of Northeastern University (Natural Science), 2023, 44(10): 1481-1489.
[10] GENE J, PARK S, SHIN H C, et al. Hybrid optical convolutional neural network with convolution kernels trained in the spatial domain[J]. Neurocomputing, 2024, 573: 127251.
[11] JIANG P X, FU H L, TAO H W, et al. Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition[J]. IEEE Access, 2019, 7: 90368-90377.
[12] 李锦, 夏鸿斌, 刘渊. 基于BERT的双特征融合注意力的方面情感分析模型[J]. 计算机科学与探索, 2024, 18(1): 205-216.
LI J, XIA H B, LIU Y. Dual features local-global attention model with BERT for aspect sentiment analysis[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 205-216.
[13] 彭凯贝, 孙小明, 陈皓炜, 等. 基于卷积神经网络的火车站语音情感识别方法[J]. 计算机仿真, 2023, 40(2): 177-180.
PENG K B, SUN X M, CHEN H W, et al. Railway station speech emotion recognition based on convolutional neural network[J]. Computer Simulation, 2023, 40(2): 177-180.
[14] HU Z F, LINGHU K H, YU H L, et al. Speech emotion recognition based on attention MCNN combined with gender information[J]. IEEE Access, 2023, 11: 50285-50294.
[15] 杨磊, 赵红东, 于快快. 基于多头注意力机制的端到端语音情感识别[J]. 计算机应用, 2022, 42(6): 1869-1875.
YANG L, ZHAO H D, YU K K. End-to-end speech emotion recognition based on multi-head attention[J]. Journal of Computer Applications, 2022, 42(6): 1869-1875.
[16] CHEN Z Z, LI J W, LIU H, et al. Learning multi-scale features for speech emotion recognition with connection attention mechanism[J]. Expert Systems with Applications, 2023, 214: 118943.
[17] LUNA-JIMéNEZ C, KLEINLEIN R, GRIOL D, et al. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset[J]. Applied Sciences, 2022, 12(1): 327.
[18] ONG K L, LEE C P, LIM H S, et al. Mel-MViTv2: enhanced speech emotion recognition with Mel spectrogram and improved multiscale vision transformers[J]. IEEE Access, 2023, 11: 108571-108579.
[19] AKHTAR M S, KUMAR A, GHOSAL D, et al. A multilayer perceptron based ensemble technique for fine-grained financial sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2017: 540-546.
[20] 孙颖, 李泽, 张雪英. 基于约束式双通道模型的语音情感识别[J]. 东北大学学报(自然科学版), 2023, 44(11): 1537-1542.
SUN Y, LI Z, ZHANG X Y. Speech emotion recognition based on constrained bi-channel model[J]. Journal of Northea-stern University (Natural Science), 2023, 44(11): 1537-1542.
[21] ZHANG X Y, XU H Y, ZHU X Z, et al. Deep contrastive clustering via hard positive sample debiased [J]. Neurocomputing, 2024, 570: 127147.
[22] 刘振焘, 徐建平, 吴敏, 等. 语音情感特征提取及其降维方法综述[J]. 计算机学报, 2018, 41(12): 2833-2851.
LIU Z T, XU J P, WU M, et al. Review of emotional feature extraction and dimension reduction method for speech emotion recognition[J]. Chinese Journal of Computers, 2018, 41(12): 2833-2851.
[23] ZHANG T, FENG G, LIANG J, et al. Acoustic scene classification based on Mel spectrogram decomposition and model merging[J]. Applied Acoustics, 2021, 182: 108258.
[24] MENG H, YAN T H, YUAN F, et al. Speech emotion recognition from 3D Log-Mel spectrograms with deep learning network[J]. IEEE Access, 2019, 7: 125868-125881.
[25] CHOLLET F. Xception: deep learning with depthwise separable convolutions[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1800-1807.
[26] HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[EB/OL]. [2024-04-23]. https://arxiv.org/abs/1606. 08415.
[27] ZHANG H, WU C R, ZHANG Z Y, et al. ResNeSt: split-attention networks[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 2735-2745.
[28] ZAMIL A A A, HASAN S, JANNATUL BAKI S M, et al. Emotion detection from speech signals using voting mechanism on classified frames[C]//Proceedings of the 2019 Inter-national Conference on Robotics, Electrical and Signal Processing Techniques. Piscataway: IEEE, 2019: 281-285.
[29] MUSTAQEEM, SAJJAD M, KWON S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM[J]. IEEE Access, 2020, 8: 79861-79875.
[30] ANVARJON T, MUSTAQEEM, KWON S. Deep-net: a light-weight CNN-based speech emotion recognition system using deep frequency features[J]. Sensors, 2020, 20(18): 5212.
[31] MUSTAQEEM, KWON S. Att-Net: enhanced emotion recognition system using lightweight self-attention module[J]. Applied Soft Computing, 2021, 102: 107101.
[32] GUIZZO E, WEYDE T, SCARDAPANE S, et al. Learning speech emotion representations in the quaternion domain[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1200-1212.

编辑推荐 0

Metrics

阅读次数

全文

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	29	0	37

	来源	本网站

	次数	66
	比例	100%

摘要

最新录用	在线预览	正式出版

32	0	42

	来源	本网站

	次数	74
	比例	100%