计算机科学与探索

• 学术研究 •    下一篇

采用双阶段多示例学习网络的语音情感识别

张石清, 陈晨, 赵小明   

  1. 1.浙江科技大学 信息与电子工程学院, 杭州 310000
    2.台州学院 智能信息处理研究所, 浙江 台州 318000

Speech Emotion Recognition Using Two-stage Multiple Instance Learning Networks

ZHANG Shiqing, CHEN Chen, ZHAO Xiaoming   

  1. 1. School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310000, China
    2. Institute of Intelligent Information Processing, Taizhou University, Taizhou, Zhejiang 318000, China

摘要: 在语音情感识别任务中,当处理不同时长的语音信号时,通常将每句语音信号分割成若干等长片段,然后根据所有片段预测结果的平均值来获得最终的情感分类。然而,这种处理方法要求用户的情绪表达在整个语音信号中是均匀分布的,但是这并不符合实际情况。针对上述问题,提出一种采用双阶段多示例学习网络的语音情感识别方法。第一阶段,将每句语音信号视为“包”,并将其分割成若干等长片段。每个语音片段视为“示例”,并提取多种声学特征,输入到相应的局部声学特征编码器,学习出各自对应的深度特征向量。然后,使用一致性注意力对不同的声学特征进行特征交互和增强。第二阶段,设计一个基于多示例学习的混合聚合器,用于在全局尺度上融合示例预测和示例特征,计算“包”级预测得分。首先,提出一种示例蒸馏模块,用于过滤情感信息较弱的冗余示例。然后,将蒸馏结果组成伪包,采用一种自适应特征聚合策略对伪包进行特征聚合,并通过分类器获得预测结果。最后,将示例级和伪包预测结果进行自适应决策聚合,以获得最终的情感分类结果。该方法在IEMOCAP和MELD公开数据集分别获得73.02%和44.92%的识别率,实验结果表明了该方法的有效性。

关键词: 语音情感识别, 多示例学习, 示例蒸馏, 一致性注意力, 聚合

Abstract: In the task of speech emotion recognition (SER), each utterance is usually divided into several equal-length segments when processing the speech signals with unequal lengths, and finally emotion classification is obtained based on the average of the prediction results of all divided segments. However, such processing methods require human emotional expression to be evenly distributed throughout the speech signals. This is not consistent with the actual situation. To address this issue, this paper proposes a SER method using two-stage multiple instance learning networks. In the first stage, each utterance is regarded as a "bag", and is segmented with equal lengths. A variety of acoustic features are extracted from the segmented samples, which are taken as "instances". Then, they are fed into the relevant local acoustic feature encoder to learn the corresponding deep feature representations. A consistency-attention mechanism is used to perform feature interaction and enhancement on these extracted different feature representations. In the second stage, a hybrid aggregator based on multi-instance learning is designed so that instance predictions and instance features are fused at the global scale to calculate "bag" level prediction scores. First, an instance distillation module is proposed to filter redundant instances with weak emotional information. Then, the distillation results are combined into a pseudo bag. The pseudo bag features are merged through an adaptive feature aggregation scheme, and then the prediction results are obtained through a classifier. Finally, instance-level and bag-level prediction results are combined by using an adaptive decision aggregation scheme so as to obtain the final emotion results. The achieved recognition accuracy on the IEMOCAP and MELD public datasets are 73.02% and 44.92%, respectively. The experimental results demonstrate the effectiveness of the proposed method.

Key words: speech emotion recognition, multiple instance learning, instance distillation, consistency-attention, aggregation