计算机科学与探索

• 学术研究 •    下一篇

融合动态卷积和注意力机制的多层感知机语音情感识别

张雨萌,张欣,高谋,赵虎林   

  1. 1. 对外经济贸易大学 外国语学院,北京 100105
    2. 中国人民解放军总医院研究生院 统计学与流行病学教研室,北京 100853
    3. 中国人民解放军总医院第一医学中心 神经外科医学部,北京 100853

Incorporating Dynamic Convolution and Attention Mechanism in Multilayer Perceptron for Speech Emotion Recognition

ZHANG Yumeng, ZHANG Xin, GAO Mou, ZHAO Hulin   

  1. 1. School of Foreign Studies, University of International Business and Economics, Beijing 100105, China
    2. Statistics and Epidemiology Teaching and Research Section, Graduate School of the Chinese PLA General Hospital, Beijing 100853, China
    3. Department of Neurosurgery of First Medical Center, Chinese PLA General Hospital, Beijing 100853, China

摘要: 语音情感识别技术通过分析语音信号推断说话者情绪,增强人机交互的自然性和智能性。然而,现有模型往往忽视时频语义信息,影响识别准确性。为此,提出了一种融合动态卷积与注意力机制的多层感知机模型,显著提高了情感识别的准确度及信息利用效率。首先,将输入的语音信号转化为梅尔频谱图,捕捉信号细节变化,更贴切地反映人类对声音的感知,为后续特征提取奠定了基础。接着,通过词元化处理将梅尔频谱图转化为词元,降低了数据的复杂性。然后,借助动态卷积与分离注意力机制高效提取关键的时频特征。一方面,动态卷积能够适应不同时间和频率上的尺度变化,优化了特征捕捉效率;另一方面,分离注意力机制增强了模型对关键信息的聚焦能力,有效提升了模型对特征的表达能力。结合动态卷积与分离注意力机制的优势,该模型能够更加充分地提取关键声学特征,从而实现了更高效、更精准的情感识别。在RAVDESS、EmoDB和CASIA三个语音情感数据库上的测试显示,模型识别准确率显著优于现有技术,达到86.11%、95.33%和82.92%。这验证了模型在复杂情感识别任务的高效性和准确性,以及动态卷积和注意力机制的有效性。

关键词: 语音情感识别, 梅尔频谱图, 多层感知机, 动态卷积, 注意力机制

Abstract: Speech emotion recognition technology infers the speaker’s emotions by analyzing the vocal signals, enhancing the naturalness and intelligence of human-computer interaction. However, existing models often overlook the semantic information of time and frequency, affecting the recognition accuracy. To address this problem, a multi-layer perceptron model that integrates dynamic convolution and attention mechanisms has been proposed, significantly improving the accuracy of emotion recognition and the efficiency of information utilization. Firstly, the input speech signals are transformed into a Mel-spectrogram to capture detailed signal variations and more accurately reflect human perception of sound, laying the foundation for subsequent feature extraction. The Mel-spectrogram is then tokenized to reduce data complexity. Next, dynamic convolution and split attention mechanisms are employed to extract key temporal-frequency features efficiently. Dynamic convolution adapts to scale changes across different time and frequency domains, thereby enhancing the efficiency of capturing features. Meanwhile, the split attention mechanism enhances the ability of the model to focus on crucial information, effectively improving the feature expressive capability. By combining the advantages of dynamic convolution and split attention mechanisms, the proposed model can fully extract crucial acoustic features, thereby achieving more efficient and accurate emotion recognition. Experiments conducted on the RAVDESS, EmoDB, and CASIA speech emotion databases show that the recognition accuracy of the proposed model significantly surpasses existing technologies, reaching 86.11%, 95.33%, and 82.92%. This verifies the effectiveness of our model in complex emotion recognition tasks, as well as the efficacy of dynamic convolution and attention mechanisms.

Key words: speech emotion recognition, Mel-spectrogram, multi-layer perceptron, dynamic convolution, attention mechanism