Speech Emotion Recognition Using Two-Stage Multiple Instance Learning Networks

doi:10.3778/j.issn.1673-9418.2402013

Abstract

Abstract: In the task of speech emotion recognition (SER), each utterance is usually divided into several equal-length segments when processing the speech signals with unequal lengths, and finally emotion classification is obtained based on the average of the prediction results of all divided segments. However, such processing methods require human emotional expression to be evenly distributed throughout the speech signals. This is not consistent with the actual situation. To address this issue, this paper proposes an SER method using two-stage multiple instance learning networks. In the first stage, each utterance is regarded as a “bag”, and is segmented with equal lengths. A variety of acoustic features are extracted from the segmented samples, which are taken as “instances”. Then, they are fed into the relevant local acoustic feature encoder to learn the corresponding deep feature representations. A consistency-attention mechanism is used to perform feature interaction and enhancement on these extracted different feature representations. In the second stage, a hybrid aggregator based on multi-instance learning is designed so that instance predictions and instance features are fused at the global scale to calculate “bag” level prediction scores. Firstly, an instance distillation module is proposed to filter redundant instances with weak emotional information. Then, the distillation results are combined into a pseudo bag. The pseudo bag features are merged through an adaptive feature aggregation scheme, and then the prediction results are obtained through a classifier. Finally, instance-level and bag-level prediction results are combined by using an adaptive decision aggregation scheme so as to obtain the final emotion results. The achieved recognition accuracy on the IEMOCAP and MELD public datasets are 73.02% and 44.92%, respectively. Experimental results demonstrate the effectiveness of the proposed method.

Key words: speech emotion recognition, multi-instance learning, instance distillation, consistency-attention, aggregation

摘要： 在语音情感识别任务中，当处理不同时长的语音信号时，通常将每句语音信号分割成若干等长片段，然后根据所有片段预测结果的平均值来获得最终的情感分类。然而，这种处理方法要求用户的情绪表达在整个语音信号中是均匀分布的，但是这并不符合实际情况。针对上述问题，提出一种采用双阶段多示例学习网络的语音情感识别方法。第一阶段，将每句语音信号视为“包”，并将其分割成若干等长片段。每个语音片段视为“示例”，并提取多种声学特征，输入到相应的局部声学特征编码器，学习出各自对应的深度特征向量。然后，使用一致性注意力对不同的声学特征进行特征交互和增强。第二阶段，设计一个基于多示例学习的混合聚合器，用于在全局尺度上融合示例预测和示例特征，计算“包”级预测得分。提出一种示例蒸馏模块，用于过滤情感信息较弱的冗余示例。将蒸馏结果组成伪包，采用一种自适应特征聚合策略对伪包进行特征聚合，并通过分类器获得预测结果。将示例级和伪包预测结果进行自适应决策聚合，以获得最终的情感分类结果。该方法在IEMOCAP和MELD公开数据集分别获得73.02%和44.92%的识别率，实验结果表明了该方法的有效性。

关键词: 语音情感识别, 多示例学习, 示例蒸馏, 一致性注意力, 聚合

ZHANG Shiqing, CHEN Chen, ZHAO Xiaoming. Speech Emotion Recognition Using Two-Stage Multiple Instance Learning Networks[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(12): 3300-3310.

张石清, 陈晨, 赵小明. 采用双阶段多示例学习网络的语音情感识别[J]. 计算机科学与探索, 2024, 18(12): 3300-3310.

References

[1] JIANG H, HU B, LIU Z, et al. Investigation of different speech types and emotions for detecting depression using different classifiers[J]. Speech Communication, 2017, 90: 39-46.
[2] DESCHAMPS-BERGER T, LAMEL L, DEVILLERS L. Investigating transformer encoders and fusion strategies for speech emotion recognition in emergency call center conversa-tions[C]//Proceedings of the 2022 International Conference on Multimodal Interaction, Bengaluru, Nov 7-11, 2022: 144-153.
[3] DISSANAYAKE V, ZHANG H, BILLINGHURST M, et al. Speech emotion recognition ‘in the wild’ using an auto-encoder[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 526-530.
[4] HOSSAIN M S, MUHAMMAD G, SONG B, et al. Audio-visual emotionaware cloud gaming framework[J]. IEEE Tran-sactions on Circuits and Systems for Video Technology, 2015, 25(12): 2105-2118.
[5] BANDELA S R, KUMAR T K. Stressed speech emotion recognition using feature fusion of teager energy operator and MFCC[C]//Proceedings of the 2017 8th International Conference on Computing, Communication and Networking Technologies. Piscataway: IEEE, 2017: 1-5.
[6] EL AYADI M, KAMEL M S, KARRAY F. Survey on speech emotion recognition: features, classification schemes, and databases[J]. Pattern recognition, 2011, 44(3): 572-587.
[7] 赵小明, 杨轶娇, 张石清. 面向深度学习的多模态情感识别研究进展[J]. 计算机科学与探索, 2022, 16(7): 1479-1503.
ZHAO X M, YANG Y J, ZHANG S Q. Survey of deep learning based multimodal emotion recognition[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1479-1503.
[8] 刘振焘, 徐建平, 吴敏, 等. 语音情感特征提取及其降维方法综述[J]. 计算机学报, 2018, 41(12): 2833-2851.
LIU Z T, XU J P, WU M, et al. Review of emotional feature extraction and dimension reduction method for speech emotion recognition[J]. Chinese Journal of Computers, 2018, 41(12): 2833-2851.
[9] 韩文静, 李海峰, 阮华斌，等. 语音情感识别研究进展综述[J]. 软件学报, 2014, 25(1): 37-50.
HAN W J, LI H F, RUAN H B, et al. Review on speech emotion recognition[J]. Journal of Software, 2014, 25(1): 37-50.
[10] 郑纯军, 王春立, 贾宁. 语音任务下声学特征提取综述[J]. 计算机科学, 2020, 47(5): 110-119.
ZHENG C J, WANG C L, JIA N. Survey of acoustic feature extraction in speech tasks[J]. Computer Science, 2020, 47(5): 110-119.
[11] SCHMIDHUBER J. Deep learning in neural networks: an overview[J]. Neural Networks, 2015, 61: 85-117.
[12] GU J, WANG Z, KUEN J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77: 354-377.
[13] GUIZZO E, WEYDE T, SCARDAPANE S, et al. Learning speech emotion representations in the quaternion domain[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1200-1212.
[14] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Confe-rence on Computer Vision and Pattern Recognition. Washington:IEEE Computer Society, 2016: 770-778.
[15] YU Y, SI X, HU C, et al. A review of recurrent neural networks: LSTM cells and network architectures[J]. Neural Computation, 2019, 31(7): 1235-1270.
[16] LIU Z, KANG X, REN F. Dual-TBNet: improving the robust-ness of speech features via dual-Transformer-BiLSTM for speech emotion recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 2193-2203.
[17] HU J, LIU Y, ZHAO J, et al. MMGCN: multimodal fusion via deep graph convolution network for emotion recognition in conversation[EB/OL]. [2023-12-03]. https://arxiv.org/abs/2107.06779.
[18] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[19] LIANG J, LI R, JIN Q. Semi-supervised multi-modal emotion recognition with cross-modal distribution matching[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 2852-2861.
[20] LIAN Z, LIU B, TAO J. CTNet: conversational transformer network for emotion recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 985-1000.
[21] CHUDASAMA V, KAR P, GUDMALWAR A, et al. M2FNet: multi-modal fusion network for emotion recognition in conversation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 4652-4661.
[22] ZHAO J, MAO X, CHEN L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J]. Biomedical Signal Processing and Control, 2019, 47: 312-323.
[23] ZHANG S, ZHAO X, TIAN Q. Spontaneous speech emotion recognition using multiscale deep convolutional LSTM[J]. IEEE Transactions on Affective Computing, 2019, 13(2): 680-688.
[24] 李锦, 夏鸿斌, 刘渊. 基于BERT的双特征融合注意力的方面情感分析模型[J]. 计算机科学与探索, 2024, 18(1): 205-216.
LI J, XIA H B, LIU Y. Dual features local-global attention model with BERT for aspect sentiment analysis[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 205-216.
[25] AK?AY M B, O?UZ K. Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers[J]. Speech Communication, 2020, 116: 56-76.
[26] DE LOPE J, GRA?A M. An ongoing review of speech emotion recognition[J]. Neurocomputing, 2023, 528: 1-11.
[27] HOU M, ZHANG Z, LU G. Multi-modal emotion recognition with self-guided modality calibration[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 4688-4692.
[28] FAN W, XU X, CAI B, et al. ISNet: individual standardization network for speech emotion recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 1803-1814.
[29] HU D, HOU X, WEI L, et al. MM-DFN: multimodal dynamic fusion network for emotion recognition in conversations[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2022: 7037-7041.
[30] ILSE M, TOMCZAK J, WELLING M. Attention-based deep multiple instance learning[C]//Proceedings of the 35th International Conference on Machine Learning, Stockholmsm?s-san, Jul 10-15, 2018: 2132-2141.
[31] MAO S, CHING P C, LEE T. Deep learning of segment-level feature representation with multiple instance learning for utterance-level speech emotion recognition[C]//Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Sep 15-19, 2019: 1686-1690.
[32] FU C, LIU C, ISHI C T, et al. MAEC: multi-instance learning with an adversarial auto-encoder-based classifier for speech emotion recognition[C]//Proceedings of the 2021 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 6299-6303.
[33] ZOU H, SI Y, CHEN C, et al. Speech emotion recognition with co-attention based multi-level acoustic information[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 7367-7371.
[34] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Res-ources and Evaluation, 2008, 42: 335-359.
[35] PORIA S, HAZARIKA D, MAJUMDER N, et al. MELD: a multimodal multi-party dataset for emotion recognition in conversations[EB/OL]. [2023-12-03]. https://arxiv.org/abs/1810.02508.
[36] LI B, LI Y, ELICEIRI K W. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14318-14328.
[37] WANG X, YAN Y, TANG P, et al. Revisiting multiple instance neural networks[J]. Pattern Recognition, 2018, 74: 15-24.
[38] LIU Y, GADEPALLI K, NOROUZI M, et al. Detecting cancer metastases on gigapixel pathology images[EB/OL]. [2023-12-03]. https://arxiv.org/abs/1703.02442.
[39] BAEVSKI A, ZHOU Y, MOHAMED A, et al. wav2vec 2.0: a framework for self-supervised learning of speech representations[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 12449-12460.
[40] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems 25,Lake Tahoe, Dec 3-6, 2012: 1106-1114.
[41] ZHANG H, MENG Y, ZHAO Y, et al. DTFD-MIL: double-tier feature distillation multiple instance learning for histopathology whole slide image classification[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 18802-18812.
[42] CAO Q, HOU M, CHEN B, et al. Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 6334-6338.
[43] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2017: 2980-2988.