[1] JIANG H, HU B, LIU Z, et al. Investigation of different speech types and emotions for detecting depression using different classifiers[J]. Speech Communication, 2017, 90: 39-46.
[2] DESCHAMPS-BERGER T, LAMEL L, DEVILLERS L. Investigating transformer encoders and fusion strategies for speech emotion recognition in emergency call center conversa-tions[C]//Proceedings of the 2022 International Conference on Multimodal Interaction, Bengaluru, Nov 7-11, 2022: 144-153.
[3] DISSANAYAKE V, ZHANG H, BILLINGHURST M, et al. Speech emotion recognition ‘in the wild’ using an auto-encoder[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, Oct 25-29, 2020: 526-530.
[4] HOSSAIN M S, MUHAMMAD G, SONG B, et al. Audio-visual emotionaware cloud gaming framework[J]. IEEE Tran-sactions on Circuits and Systems for Video Technology, 2015, 25(12): 2105-2118.
[5] BANDELA S R, KUMAR T K. Stressed speech emotion recognition using feature fusion of teager energy operator and MFCC[C]//Proceedings of the 2017 8th International Conference on Computing, Communication and Networking Technologies. Piscataway: IEEE, 2017: 1-5.
[6] EL AYADI M, KAMEL M S, KARRAY F. Survey on speech emotion recognition: features, classification schemes, and databases[J]. Pattern recognition, 2011, 44(3): 572-587.
[7] 赵小明, 杨轶娇, 张石清. 面向深度学习的多模态情感识别研究进展[J]. 计算机科学与探索, 2022, 16(7): 1479-1503.
ZHAO X M, YANG Y J, ZHANG S Q. Survey of deep learning based multimodal emotion recognition[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1479-1503.
[8] 刘振焘, 徐建平, 吴敏, 等. 语音情感特征提取及其降维方法综述[J]. 计算机学报, 2018, 41(12): 2833-2851.
LIU Z T, XU J P, WU M, et al. Review of emotional feature extraction and dimension reduction method for speech emotion recognition[J]. Chinese Journal of Computers, 2018, 41(12): 2833-2851.
[9] 韩文静, 李海峰, 阮华斌,等. 语音情感识别研究进展综述[J]. 软件学报, 2014, 25(1): 37-50.
HAN W J, LI H F, RUAN H B, et al. Review on speech emotion recognition[J]. Journal of Software, 2014, 25(1): 37-50.
[10] 郑纯军, 王春立, 贾宁. 语音任务下声学特征提取综述[J]. 计算机科学, 2020, 47(5): 110-119.
ZHENG C J, WANG C L, JIA N. Survey of acoustic feature extraction in speech tasks[J]. Computer Science, 2020, 47(5): 110-119.
[11] SCHMIDHUBER J. Deep learning in neural networks: an overview[J]. Neural Networks, 2015, 61: 85-117.
[12] GU J, WANG Z, KUEN J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77: 354-377.
[13] GUIZZO E, WEYDE T, SCARDAPANE S, et al. Learning speech emotion representations in the quaternion domain[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1200-1212.
[14] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Confe-rence on Computer Vision and Pattern Recognition. Washington:IEEE Computer Society, 2016: 770-778.
[15] YU Y, SI X, HU C, et al. A review of recurrent neural networks: LSTM cells and network architectures[J]. Neural Computation, 2019, 31(7): 1235-1270.
[16] LIU Z, KANG X, REN F. Dual-TBNet: improving the robust-ness of speech features via dual-Transformer-BiLSTM for speech emotion recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 2193-2203.
[17] HU J, LIU Y, ZHAO J, et al. MMGCN: multimodal fusion via deep graph convolution network for emotion recognition in conversation[EB/OL]. [2023-12-03]. https://arxiv.org/abs/2107.06779.
[18] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[19] LIANG J, LI R, JIN Q. Semi-supervised multi-modal emotion recognition with cross-modal distribution matching[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 2852-2861.
[20] LIAN Z, LIU B, TAO J. CTNet: conversational transformer network for emotion recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 985-1000.
[21] CHUDASAMA V, KAR P, GUDMALWAR A, et al. M2FNet: multi-modal fusion network for emotion recognition in conversation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 4652-4661.
[22] ZHAO J, MAO X, CHEN L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J]. Biomedical Signal Processing and Control, 2019, 47: 312-323.
[23] ZHANG S, ZHAO X, TIAN Q. Spontaneous speech emotion recognition using multiscale deep convolutional LSTM[J]. IEEE Transactions on Affective Computing, 2019, 13(2): 680-688.
[24] 李锦, 夏鸿斌, 刘渊. 基于BERT的双特征融合注意力的方面情感分析模型[J]. 计算机科学与探索, 2024, 18(1): 205-216.
LI J, XIA H B, LIU Y. Dual features local-global attention model with BERT for aspect sentiment analysis[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 205-216.
[25] AK?AY M B, O?UZ K. Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers[J]. Speech Communication, 2020, 116: 56-76.
[26] DE LOPE J, GRA?A M. An ongoing review of speech emotion recognition[J]. Neurocomputing, 2023, 528: 1-11.
[27] HOU M, ZHANG Z, LU G. Multi-modal emotion recognition with self-guided modality calibration[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 4688-4692.
[28] FAN W, XU X, CAI B, et al. ISNet: individual standardization network for speech emotion recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 1803-1814.
[29] HU D, HOU X, WEI L, et al. MM-DFN: multimodal dynamic fusion network for emotion recognition in conversations[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2022: 7037-7041.
[30] ILSE M, TOMCZAK J, WELLING M. Attention-based deep multiple instance learning[C]//Proceedings of the 35th International Conference on Machine Learning, Stockholmsm?s-san, Jul 10-15, 2018: 2132-2141.
[31] MAO S, CHING P C, LEE T. Deep learning of segment-level feature representation with multiple instance learning for utterance-level speech emotion recognition[C]//Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Sep 15-19, 2019: 1686-1690.
[32] FU C, LIU C, ISHI C T, et al. MAEC: multi-instance learning with an adversarial auto-encoder-based classifier for speech emotion recognition[C]//Proceedings of the 2021 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 6299-6303.
[33] ZOU H, SI Y, CHEN C, et al. Speech emotion recognition with co-attention based multi-level acoustic information[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 7367-7371.
[34] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Res-ources and Evaluation, 2008, 42: 335-359.
[35] PORIA S, HAZARIKA D, MAJUMDER N, et al. MELD: a multimodal multi-party dataset for emotion recognition in conversations[EB/OL]. [2023-12-03]. https://arxiv.org/abs/1810.02508.
[36] LI B, LI Y, ELICEIRI K W. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14318-14328.
[37] WANG X, YAN Y, TANG P, et al. Revisiting multiple instance neural networks[J]. Pattern Recognition, 2018, 74: 15-24.
[38] LIU Y, GADEPALLI K, NOROUZI M, et al. Detecting cancer metastases on gigapixel pathology images[EB/OL]. [2023-12-03]. https://arxiv.org/abs/1703.02442.
[39] BAEVSKI A, ZHOU Y, MOHAMED A, et al. wav2vec 2.0: a framework for self-supervised learning of speech representations[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 12449-12460.
[40] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems 25,Lake Tahoe, Dec 3-6, 2012: 1106-1114.
[41] ZHANG H, MENG Y, ZHAO Y, et al. DTFD-MIL: double-tier feature distillation multiple instance learning for histopathology whole slide image classification[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 18802-18812.
[42] CAO Q, HOU M, CHEN B, et al. Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 6334-6338.
[43] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2017: 2980-2988. |