计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (7): 1479-1503.DOI: 10.3778/j.issn.1673-9418.2112081
收稿日期:
2021-12-20
修回日期:
2022-02-14
出版日期:
2022-07-01
发布日期:
2022-03-09
作者简介:
赵小明(1964—),男,浙江临海人,硕士,教授,主要研究方向为音频和图像处理、机器学习、模式识别等。 基金资助:
ZHAO Xiaoming1,2,+(), YANG Yijiao1, ZHANG Shiqing2
Received:
2021-12-20
Revised:
2022-02-14
Online:
2022-07-01
Published:
2022-03-09
Supported by:
摘要:
多模态情感识别是指通过与人类情感表达相关的语音、视觉、文本等不同模态信息来识别人的情感状态。该研究在人机交互、人工智能、情感计算等领域有着重要的研究意义,备受研究者关注。鉴于近年来发展起来的深度学习方法在各种任务中所取得的巨大成功,目前各种深度神经网络已被用于学习高层次的情感特征表示,用于多模态情感识别。为了系统地总结深度学习方法在多模态情感识别领域中的研究现状,拟对近年来面向深度学习的多模态情感识别研究文献进行分析与归纳。首先,给出了多模态情感识别的一般框架,并介绍了常用的多模态情感数据集。然后,简要回顾了代表性深度学习技术的原理及其进展。随后,重点详细介绍了多模态情感识别中的两个关键步骤的研究进展:与语音、视觉、文本等不同模态相关的情感特征提取方法,包括手工特征和深度特征;融合不同模态信息的多模态信息融合策略。最后,分析了该领域面临的挑战和机遇,并指出了未来的发展方向。
中图分类号:
赵小明, 杨轶娇, 张石清. 面向深度学习的多模态情感识别研究进展[J]. 计算机科学与探索, 2022, 16(7): 1479-1503.
ZHAO Xiaoming, YANG Yijiao, ZHANG Shiqing. Survey of Deep Learning Based Multimodal Emotion Recognition[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1479-1503.
数据集 | 年份 | 模态 | 简要介绍 | 情感标签 |
---|---|---|---|---|
eNTERFACE’05[ | 2006 | 语音、视觉 | 1 277个视听样本,来自14个不同国家的42名参与者 | 愤怒、厌恶、恐惧、快乐、悲伤、惊讶 |
RML[ | 2008 | 语音、视觉 | 720个由视听情感表达的样本组成,8名参与者 | 愤怒、厌恶、恐惧、幸福、悲伤、惊讶 |
IEMOCAP[ | 2008 | 语音、视觉、姿势、文本 | 10 039段对话:平均持续时间为4.5 s,平均单词数为11.4;10名演员 | 中性、快乐、悲伤、愤怒、惊讶、恐惧、厌恶、沮丧、兴奋等;维度标签:效价、唤醒和支配 |
AFEW[ | 2012 | 语音、视觉 | 1 426个视频片段组成 | 愤怒、厌恶、恐惧、幸福、悲伤、惊讶、中性 |
BAUM-1s[ | 2016 | 语音、视觉 | 1 222个视频样本,31名土耳其受试者 | 快乐、愤怒、悲伤、厌恶、恐惧、惊讶 |
CHEAVD[ | 2016 | 语音、视觉 | 来自电影、电视剧、电视节目的140 min的自发情感片段,238名说话者 | 有26种非原型的情感状态,前8个主要情感为愤怒、快乐、悲伤、担心、焦虑、惊讶、厌恶、中性 |
CMU-MOSI[ | 2016 | 语音、视觉、文本 | 2 199个评论的话语、93段说话者视频 | 消极、积极 |
RAMAS[ | 2018 | 语音、视觉、姿势、生理信号 | 大约7 h的高质量特写视频记录,10位演员 | 愤怒、厌恶、快乐、悲伤、恐惧、惊讶 |
RAVDESS[ | 2018 | 语音、视觉 | 60段演讲,44首歌曲,24位演员 | 中性、平静、快乐、悲伤、愤怒、恐惧、厌恶、惊讶 |
CMU-MOSEI[ | 2018 | 语音、视觉、文本 | 来自1 000多名在线YouTube演讲者的3 837段视频 | 快乐、悲伤、愤怒、恐惧、厌恶、惊讶 |
MELD[ | 2019 | 语音、视觉、文本 | 包含了电视剧 | 愤怒、厌恶、恐惧、喜悦、中立、悲伤、惊讶;正面、负面和中性 |
CH-SIMS[ | 2020 | 语音、视觉、文本 | 2 281个野外视频片段 | 消极、弱消极、中性、弱积极、积极 |
HEU-part1[ | 2021 | 视觉、姿势 | 总共19 004个视频片段,根据数据源分为两部分,共有9 951名受试者 | 愤怒、无聊、困惑、失望、厌恶、恐惧、快乐、中立、悲伤、惊讶 |
HEU-part2[ | 2021 | 语音、视觉、姿势 |
表1 多模态情感数据集
Table 1 Multimodal emotional datasets
数据集 | 年份 | 模态 | 简要介绍 | 情感标签 |
---|---|---|---|---|
eNTERFACE’05[ | 2006 | 语音、视觉 | 1 277个视听样本,来自14个不同国家的42名参与者 | 愤怒、厌恶、恐惧、快乐、悲伤、惊讶 |
RML[ | 2008 | 语音、视觉 | 720个由视听情感表达的样本组成,8名参与者 | 愤怒、厌恶、恐惧、幸福、悲伤、惊讶 |
IEMOCAP[ | 2008 | 语音、视觉、姿势、文本 | 10 039段对话:平均持续时间为4.5 s,平均单词数为11.4;10名演员 | 中性、快乐、悲伤、愤怒、惊讶、恐惧、厌恶、沮丧、兴奋等;维度标签:效价、唤醒和支配 |
AFEW[ | 2012 | 语音、视觉 | 1 426个视频片段组成 | 愤怒、厌恶、恐惧、幸福、悲伤、惊讶、中性 |
BAUM-1s[ | 2016 | 语音、视觉 | 1 222个视频样本,31名土耳其受试者 | 快乐、愤怒、悲伤、厌恶、恐惧、惊讶 |
CHEAVD[ | 2016 | 语音、视觉 | 来自电影、电视剧、电视节目的140 min的自发情感片段,238名说话者 | 有26种非原型的情感状态,前8个主要情感为愤怒、快乐、悲伤、担心、焦虑、惊讶、厌恶、中性 |
CMU-MOSI[ | 2016 | 语音、视觉、文本 | 2 199个评论的话语、93段说话者视频 | 消极、积极 |
RAMAS[ | 2018 | 语音、视觉、姿势、生理信号 | 大约7 h的高质量特写视频记录,10位演员 | 愤怒、厌恶、快乐、悲伤、恐惧、惊讶 |
RAVDESS[ | 2018 | 语音、视觉 | 60段演讲,44首歌曲,24位演员 | 中性、平静、快乐、悲伤、愤怒、恐惧、厌恶、惊讶 |
CMU-MOSEI[ | 2018 | 语音、视觉、文本 | 来自1 000多名在线YouTube演讲者的3 837段视频 | 快乐、悲伤、愤怒、恐惧、厌恶、惊讶 |
MELD[ | 2019 | 语音、视觉、文本 | 包含了电视剧 | 愤怒、厌恶、恐惧、喜悦、中立、悲伤、惊讶;正面、负面和中性 |
CH-SIMS[ | 2020 | 语音、视觉、文本 | 2 281个野外视频片段 | 消极、弱消极、中性、弱积极、积极 |
HEU-part1[ | 2021 | 视觉、姿势 | 总共19 004个视频片段,根据数据源分为两部分,共有9 951名受试者 | 愤怒、无聊、困惑、失望、厌恶、恐惧、快乐、中立、悲伤、惊讶 |
HEU-part2[ | 2021 | 语音、视觉、姿势 |
时间 | 作者 | 模态 | 特征提取 | 融合方式 | 分类/回归 | 数据集 | 识别结果 |
---|---|---|---|---|---|---|---|
2020 | Huang等[ | 语音、视觉 | 语音:eGeMAPS 视觉:几何特征 | 模型层融合 (Transformer+LSTM) | 全连接层 | AVEC 2017 | CCC(Arousal维度):0.654 CCC(Valence维度):0.708 |
2020 | 刘菁菁等[ | 语音、视觉 | 语音:MFCC、Fbank等 视觉:人脸特征点间的距离长度 | 特征层融合 决策层融合 模型层融合 (双层LSTM) | Softmax | eNTERFACE’05 | Acc(6-class):74.40% |
2021 | Liu等[ | 语音、视觉 | 语音:声谱图+2D-CNN 视觉:VGG16 | 模型层融合 (GapsGCN) | 全连接层 | eNTERFACE’05 | Acc(6-class):80.83% F1-score:80.23% |
2021 | 王传昱等[ | 语音、视觉 | 语音:DBM+LSTM 视觉:LBPH+SAE+CNN | 决策层融合 | Softmax | CHEAVD | Acc(6-class):74.90% |
2018 | Hazarika等[ | 语音、文本 | 语音: MFCC等 文本:FastText+CNN | 特征层融合 (Self-Attention) | Softmax | IEMOCAP | Acc(4-class):71.40% F1-score:71.30% |
表2 多模态信息融合方法
Table 2 Multimodal information fusion methods
时间 | 作者 | 模态 | 特征提取 | 融合方式 | 分类/回归 | 数据集 | 识别结果 |
---|---|---|---|---|---|---|---|
2020 | Huang等[ | 语音、视觉 | 语音:eGeMAPS 视觉:几何特征 | 模型层融合 (Transformer+LSTM) | 全连接层 | AVEC 2017 | CCC(Arousal维度):0.654 CCC(Valence维度):0.708 |
2020 | 刘菁菁等[ | 语音、视觉 | 语音:MFCC、Fbank等 视觉:人脸特征点间的距离长度 | 特征层融合 决策层融合 模型层融合 (双层LSTM) | Softmax | eNTERFACE’05 | Acc(6-class):74.40% |
2021 | Liu等[ | 语音、视觉 | 语音:声谱图+2D-CNN 视觉:VGG16 | 模型层融合 (GapsGCN) | 全连接层 | eNTERFACE’05 | Acc(6-class):80.83% F1-score:80.23% |
2021 | 王传昱等[ | 语音、视觉 | 语音:DBM+LSTM 视觉:LBPH+SAE+CNN | 决策层融合 | Softmax | CHEAVD | Acc(6-class):74.90% |
2018 | Hazarika等[ | 语音、文本 | 语音: MFCC等 文本:FastText+CNN | 特征层融合 (Self-Attention) | Softmax | IEMOCAP | Acc(4-class):71.40% F1-score:71.30% |
[1] | DINO H I, ABDULRAZZAQ M B. Facial expression classifica-tion based on SVM, KNN and MLP classifiers[C]// Procee-dings of the 2019 International Conference on Advanced Science and Engineering, Duhok, Apr 2-4, 2019. Piscata-way: IEEE, 2019: 70-75. |
[2] |
PERVEEN N, ROY D, CHALAVADI K M. Facial expres-sion recognition in videos using dynamic kernels[J]. IEEE Transactions on Image Processing, 2020, 29: 8316-8325.
DOI URL |
[3] | SHRIVASTAVA V, RICHHARIYA V, RICHHARIYA V. Puzz-ling out emotions: a deep-learning approach to multimodal sentiment analysis[C]// Proceedings of the 2018 International Conference on Advanced Computation and Telecommunica-tion, Bhopal, Dec 28-29, 2018. Piscataway: IEEE, 2018: 1-6. |
[4] | SCHERER K R. Psychological models of emotion[J]. The Neu-ropsychology of Emotion, 2000, 137(3): 137-162. |
[5] | AMMEN S, ALFARRAS M, HADI W. OFDM system per-formance enhancement using discrete wavelet transform and DSSS system over mobile channel[R]. Advances in Com-puter Science and Engineering, 2010: 142-147. |
[6] | LIANG J J, CHEN S Z, JIN Q. Semi-supervised multimodal emotion recognition with improved Wasserstein GANs[C]// Proceedings of the 2019 Asia-Pacific Signal and Informa-tion Processing Association Annual Summit and Conference, Lanzhou, Nov 18-21, 2019. Piscataway: IEEE, 2019: 695-703. |
[7] | AL-SULTAN M R, AMEEN S Y, ABDUALLAH W M. Real time implementation of stegofirewall system[J]. Interna-tional Journal of Computing Digital Systems, 2019, 8(5): 498-504. |
[8] | ZHANG Y Y, WANG Z R, DU J. Deep fusion: an attention guided factorized bilinear pooling for audio-video emotion recognition[C]// Proceedings of the 2019 International Joint Conference on Neural Networks, Budapest, Jul 14-19, 2019. Piscataway: IEEE, 2019: 1-8. |
[9] | CHEN J, LV Y, XU R, et al. Automatic social signal analysis: facial expression recognition using difference convolution neural network[J]. Journal of Parallel Distributed Compu-ting, 2019, 131: 97-102. |
[10] | GHALEB E, POPA M, ASTERIADIS S. Multimodal and temporal perception of audio-visual cues for emotion recogni-tion[C]// Proceedings of the 2019 8th International Conference on Affective Computing and Intelligent Interaction, Cam-bridge, Sep 3-6, 2019. Piscataway: IEEE, 2019: 552-558. |
[11] | ABDULRAZZAQ M B, KHALAF K I. Handwritten nume-rals’ recognition in Kurdish language using double feature selection[C]// Proceedings of the 2019 2nd International Confe-rence on Engineering Technology and Its Applications, Al-Najef, Aug 27-28, 2019. Piscataway: IEEE, 2019: 167-172. |
[12] | CAIHUA C. Research on multi-modal Mandarin speech emo-tion recognition based on SVM[C]// Proceedings of the 2019 IEEE International Conference on Power, Intelligent Computing and Systems, Shenyang, Jul 12-14, 2019. Pisca-taway: IEEE, 2019: 173-176. |
[13] | SCHULLER B W, VALSTER M F, EYBEN F, et al. AVEC 2012: the continuous audio/visual emotion challenge[C]// Proceedings of the 2012 International Conference on Multi-modal Interaction, Santa Monica, Oct 22-26, 2012. New York: ACM, 2012: 449-456. |
[14] | DHALL A, GOECKE R, JOSHI J, et al. Emotion recogni-tion in the wild challenge 2013[C]// Proceedings of the 2013 International Conference on Multimodal Interaction, Sy-dney, Dec 9-13, 2013. New York: ACM, 2013: 509-516. |
[15] | STAPPEN L, BAIRD A, RIZOS G, et al. MuSe 2020 Chal-lenge and Workshop: multimodal sentiment analysis, emo-tion target engagement and trustworthiness detection in real-life media: emotional car reviews in-the-wild[C]// Procee-dings of the 1st International on Multimodal Sentiment Ana-lysis in Real-Life Media Challenge and Workshop, Seattle, Oct 16, 2020. New York: ACM, 2020: 35-44. |
[16] | STAPPEN L, MEßNER E M, CAMBRIA E, et al. MuSe 2021 challenge: multimodal emotion, sentiment, physiolo-gical-emotion, and stress detection[C]// Proceedings of the 2021 ACM International Conference on Multimedia, Oct 20-24, 2021. New York: ACM, 2021: 5706-5707. |
[17] | LI Y, TAO J H, SCHULLER B W, et al. MEC 2016: the multimodal emotion recognition challenge of CCPR 2016[C]// Proceedings of the 7th Chinese Conference on Pattern Recognition, Chengdu, Nov 5-7, 2016. Cham: Springer, 2016: 667-678. |
[18] | OBAID K B, ZEEBAREE S, AHMED O M. Deep learning models based on image classification: a review[J]. Interna-tional Journal of Science Business, 2020, 4(11): 75-81. |
[19] |
ZHAO X, SHI X, ZHANG S. Facial expression recognition via deep learning[J]. IETE Technical Review, 2015, 32(5): 347-355.
DOI URL |
[20] |
SCHMIDHUBER J. Deep learning in neural networks: an over-view[J]. Neural Networks, 2015, 61: 85-117.
DOI URL |
[21] |
HINTON G E, OSINDERO S, TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7): 1527-1554.
DOI URL |
[22] |
LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
DOI URL |
[23] |
ELMAN J L. Finding structure in time[J]. Cognitive Science, 1990, 14(2): 179-211.
DOI URL |
[24] | D’MELLO S K, KORY J. A review and meta-analysis of multi-modal affect detection systems[J]. ACM Computing Surveys, 2015, 47(3): 1-36. |
[25] | RISH I. An empirical study of the naive Bayes classifier[C]// Proceedings of the 2001 Workshop on Empirical Me-thods in A.pngicial Intelligence, Seattle, 2001: 41-46. |
[26] |
KEERTHI S S, SHEVADE S K, BHATTACHARYYA C, et al. Improvements to Platt’s SMO algorithm for SVM clas-sifier design[J]. Neural Computation, 2001, 13(3): 637-649.
DOI URL |
[27] |
WINDEATT T. Accuracy/diversity and ensemble MLP classi-fier design[J]. IEEE Transactions on Neural Networks, 2006, 17(5): 1194-1211.
DOI URL |
[28] | MARTIN O, KOTSIA I, MACQ B, et al. The eNTERFACE’05 audio-visual emotion database[C]// Proceedings of the 22nd International Conference on Data Engineering Workshops, Atlanta, Apr 3-7, 2006. Washington: IEEE Computer Society, 2006: 8. |
[29] | WANG Y, GUAN L. Recognizing human emotional state from audiovisual signals[J]. IEEE Transactions on Multime-dia, 2008, 10(5): 936-946. |
[30] |
BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interac-tive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42(4): 335-359.
DOI URL |
[31] |
DHALL A, GOECKE R, LUCEY S, et al. Collecting large, richly annotated facial-expression databases from movies[J]. IEEE Multimedia, 2012, 19(3): 34-41.
DOI URL |
[32] | ZHALEHPOUR S, ONDER O, AKHTAR Z, et al. BAUM-1: a spontaneous audio-visual face database of affective and mental states[J]. IEEE Transactions on Affective Com-puting, 2016, 8(3): 300-313. |
[33] |
ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal senti-ment intensity analysis in videos: facial gestures and ver-bal messages[J]. IEEE Intelligent Systems, 2016, 31(6): 82-88.
DOI URL |
[34] | PEREPELKINA O, KAZIMIROVA E, KONSTANTINOVA M. RAMAS. Russian multimodal corpus of dyadic intera-ction for affective computing[C]// LNCS 11096: Proceedings of the 20th International Conference on Speech and Com-puter, Leipzig, Sep 18-22, 2018. Cham: Springer, 2018: 501-510. |
[35] |
LIVINGSTONE S R, RUSSO F A. The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English[J]. PLoS One, 2018, 13(5): e0196391.
DOI URL |
[36] | ZADEH A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Jul 15-20, 2018. Stroudsburg: ACL, 2018: 2236-2246. |
[37] | PORIA S, HAZARIKA D, MAJUMDER N, et al. MELD: a multimodal multi-party dataset for emotion recognition in conversations[C]// Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Jul 28-Aug 2, 2019. Stroudsburg: ACL, 2019: 527-536. |
[38] | YU W, XU H, MENG F, et al. CH-SIMS: a Chinese multi-modal sentiment analysis dataset with fine-grained annota-tion of modality[C]// Proceedings of the 58th Annual Mee-ting of the Association for Computational Linguistics, Jul 5-10, 2020. Stroudsburg: ACL, 2020: 3718-3727. |
[39] |
CHEN J, WANG C H, WANG K J, et al. HEU emotion: a large-scale database for multimodal emotion recognition in the wild[J]. Neural Computing and Applications, 2021, 33(14): 8669-8685.
DOI URL |
[40] |
DENG L, YU D. Deep learning: methods and applications[J]. Foundations and Trends in Signal Processing, 2014, 7(3/4): 197-387.
DOI URL |
[41] | FREUND Y, HAUSSLER D. Unsupervised learning of dis-tributions of binary vectors using 2-layer networks[C]// Ad-vances in Neural Information Processing Systems 4, Denver, Dec 2-5, 1991. San Mateo: Morgan Kaufmann, 1991: 912-919. |
[42] | BENGIO Y, LAMBLIN P, POPOVICI D, et al. Greedy layer-wise training of deep networks[C]// Proceedings of the 20th Annual Conference on Neural Information Processing Systems, Vancouver, Dec 4-7, 2006. Cambridge: MIT Press, 2007: 153-160. |
[43] |
HINTON G E. Training products of experts by minimizing contrastive divergence[J]. Neural Computation, 2002, 14(8): 1771-1800.
DOI URL |
[44] | LEE H, GROSSE R B, RANGANATH R, et al. Convolu-tional deep belief networks for scalable unsupervised lear-ning of hierarchical representations[C]// Proceedings of the 26th International Conference on Machine Learning, Mon-treal, Jun 14-18, 2009. New York: ACM, 2009: 609-616. |
[45] | WANG G M, QIAO J F, BI J, et al. TL-GDBN: growing deep belief network with transfer learning[J]. IEEE Transac-tions on Automation Science and Engineering, 2018, 16(2): 874-885. |
[46] | DENG W, LIU H L, XU J J, et al. An improved quantum-inspired differential evolution algorithm for deep belief net-work[J]. IEEE Transactions on Instrumentation and Mea-surement, 2020, 69(10): 7319-7327. |
[47] |
LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
DOI URL |
[48] | KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Image-Net classification with deep convolutional neural networks[C]// Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, Dec 3-6, 2012. Red Hook: Curran Associates, 2012: 1106-1114. |
[49] | SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014. |
[50] | SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]// Proceedings of the 2015 IEEE Conferen-ce on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 1-9. |
[51] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual lear-ning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion, Las Vegas, Jun 27-30, 2016. Washington: IEEE Com-puter Society, 2016: 770-778. |
[52] | HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 4700-4708. |
[53] | TRAN D, BOURDEV L D, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]// Proceedings of the 2015 IEEE International Confe-rence on Computer Vision, Santiago, Dec 7-13, 2015. Was-hington: IEEE Computer Society, 2015: 4489-4497. |
[54] |
YANG H, YUAN C F, LI B, et al. Asymmetric 3D convolu-tional neural networks for action recognition[J]. Pattern Recognition, 2019, 85: 1-12.
DOI URL |
[55] | KUMAWAT S, RAMAN S. LP-3DCNN: unveiling local phase in 3D convolutional neural networks[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 4903-4912. |
[56] | CHEN H, WANG Y, SHU H, et al. Frequency domain com-pact 3D convolutional neural networks[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 1638-1647. |
[57] |
WERBOS P J. Backpropagation through time: what it does and how to do it[J]. Proceedings of the IEEE, 1990, 78(10): 1550-1560.
DOI URL |
[58] |
HOCHREITER S, SCHMIDHUBER J. Long short-term me-mory[J]. Neural Computation, 1997, 9(8): 1735-1780.
DOI URL |
[59] | CHUNG J, GÜLÇEHRE Ç, CHO K, et al. Empirical eva-luation of gated recurrent neural networks on sequence mo-deling[J]. arXiv:1412. 3555, 2014. |
[60] | ZHAO R, WANG K, SU H, et al. Bayesian graph convo-lution LSTM for skeleton based action recognition[C]// Pro-ceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 6881-6891. |
[61] |
ZHANG S, ZHAO X, TIAN Q. Spontaneous speech emo-tion recognition using multiscale deep convolutional LSTM[J]. IEEE Transactions on Affective Computing, 2019. DOI: 10.1109/TAFFC.2019.2947464.
DOI |
[62] | XING Y, DI CATERINA G, SORAGHAN J. A new spiking convolutional recurrent neural network (SCRNN) with app-lications to event-based hand gesture recognition[J]. Fron-tiers in Neuroscience, 2020, 14: 1143. |
[63] | 高庆吉, 赵志华, 徐达, 等. 语音情感识别研究综述[J]. 智能系统学报, 2020, 15(1): 1-13. |
GAO Q J, ZHAO Z H, XU D, et al. Review on speech emotion recognition research[J]. CAAI Transactions on Intelligent Systems, 2020, 15(1): 1-13. | |
[64] | 刘振焘, 徐建平, 吴敏, 等. 语音情感特征提取及其降维方法综述[J]. 计算机学报, 2018, 41(12): 2833-2851. |
LIU Z T, XU J P, WU M, et al. Review of emotional feature extraction and dimension reduction method for speech emo-tion recognition[J]. Chinese Journal of Computers, 2018, 41(12): 2833-2851. | |
[65] | 韩文静, 李海峰, 阮华斌, 等. 语音情感识别研究进展综述[J]. 软件学报, 2014, 25(1): 37-50. |
HAN W J, LI H F, RUAN H B, et al. Review on speech emo-tion recognition[J]. Journal of Software, 2014, 25(1): 37-50. | |
[66] | 郑纯军, 王春立, 贾宁. 语音任务下声学特征提取综述[J]. 计算机科学, 2020, 47(5): 110-119. |
ZHENG C J, WANG C L, JIA N. Survey of acoustic fea-ture extraction in speech tasks[J]. Computing Science, 2020, 47(5): 110-119. | |
[67] | LISCOMBE J, VENDITTI J, HIRSCHBERG J B. Classi-fying subject ratings of emotional speech using acoustic features[C]// Proceedings of the 8th European Conference on Speech Communication and Technology, Geneva, Sep 1-4, 2003. |
[68] | YACOUB S M, SIMSKE S J, LIN X F, et al. Recognition of emotions in interactive voice response systems[C]// Procee-dings of the 8th European Conference on Speech Communi-cation and Technology, Geneva, Sep 1-4, 2003. |
[69] | SCHMITT M, RINGEVAL F, SCHULLER B W. At the bor-der of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech[C]// Proceedings of the 17th Annual Conference of the International Speech Commu-nication Association, San Francisco, Sep 8-12, 2016: 495-499. |
[70] | 孙韩玉, 黄丽霞, 张雪英, 等. 基于双通道卷积门控循环网络的语音情感识别[J/OL]. 计算机工程与应用(2021-10-18)[2022-02-28]. https://kns.cnki.net/kcms/detail/11.2127.TP.20211015.2021.002.html. |
SUN H Y, HUANG L X, ZHANG X Y, et al. Speech emo-tion recognition based on dual-channel convolutional gated recurrent network[J/OL]. Computer Engineering and App-lications(2021-10-18)[2022-02-28]. https://kns.cnki.net/kcms/detail/11.2127.TP.20211015.2021.002.htm. | |
[71] |
LUENGO I, NAVAS E, HERNÁEZ I. Feature analysis and evaluation for automatic emotion ide.pngication in speech[J]. IEEE Transactions on Multimedia, 2010, 12(6): 490-501.
DOI URL |
[72] | DUTTA K, SARMA K K. Multiple feature extraction for RNN-based assamese speech recognition for speech to text conversion application[C]// Proceedings of the 2012 Inter-national Conference on Communications, Devices and Intelligent Systems, Kolkata, Dec 28-29, 2012. Piscataway: IEEE, 2013: 1-6. |
[73] | MAO Q, DONG M, HUANG Z, et al. Learning salient features for speech emotion recognition using convolu-tional neural networks[J]. IEEE Transactions on Multime-dia, 2014, 16(8): 2203-2213. |
[74] | 陈婧, 李海峰, 马琳, 等. 多粒度特征融合的维度语音情感识别方法[J]. 信号处理, 2017, 33(3): 374-382. |
CHEN J, LI H F, MA L, et al. Multi-granularity feature fusion for dimensional speech emotion recognition[J]. Jour-nal of Signal Processing, 2017, 33(3): 374-382. | |
[75] | 俞佳佳, 金赟, 马勇, 等. 基于Sinc-Transformer模型的原始语音情感识别[J]. 信号处理, 2021, 37(10): 1880-1888. |
YU J J, JIN Y, MA Y, et al. Emotion recognition from raw speech based on Sinc-Transformer model[J]. Journal of Signal Processing, 2021, 37(10): 1880-1888. | |
[76] | ZHANG S Q, ZHAO X M, CHUANG Y L, et al. Feature learning via deep belief network for Chinese speech emo-tion recognition[C]// Proceedings of the 7th Chinese Confe-rence on Pattern Recognition, Chengdu, Nov 5-7, 2016. Cham: Springer, 2016: 645-651. |
[77] | OTTL S, AMIRIPARIAN S, GERCZUK M, et al. Group-level speech emotion recognition utilising deep spectrum fea-tures[C]// Proceedings of the 2020 International Conference on Multimodal Interaction, Oct 25-29, 2020. New York: ACM, 2020: 821-826. |
[78] | EYBEN F, WÖLLMER M, SCHULLER B W. OpenSMILE: the munich versatile and fast open-source audio feature extractor[C]// Proceedings of the 18th International Confe-rence on Multimedia 2010, Firenze, Oct 25-29, 2010. New York: ACM, 2010: 1459-1462. |
[79] | 蒋斌, 钟瑞, 张秋闻, 等. 采用深度学习方法的非正面表情识别综述[J]. 计算机工程与应用, 2021, 57(8): 48-61. |
JIANG B, ZHONG R, ZHANG Q W, et al. Survey of non-frontal facial expression recognition by using deep learning methods[J]. Computer Engineering and Applications, 2021, 57(8): 48-61. | |
[80] | 李珊, 邓伟洪. 深度人脸表情识别研究进展[J]. 中国图象图形学报, 2020, 25(11): 2306-2320. |
LI S, DENG W H. Deep facial expression recognition: a survey[J]. Journal of Image and Graphics, 2020, 25(11): 2306-2320. | |
[81] |
MELLOUK W, HANDOUZI W. Facial emotion recogni-tion using deep learning: review and insights[J]. Procedia Computer Science, 2020, 175: 689-694.
DOI URL |
[82] | ZHAO X, ZHANG S. A review on facial expression recog-nition: feature extraction and classification[J]. IETE Techni-cal Review, 2016, 33(5): 505-517. |
[83] |
CHEN J, LIU X, TU P, et al. Learning person-specific mo-dels for facial expression and action unit recognition[J]. Pattern Recognition Letters, 2013, 34(15): 1964-1970.
DOI URL |
[84] | ZHANG S, ZHAO X, LEI B. Facial expression recognition based on local binary patterns and local Fisher discriminant analysis[J]. WSEAS Transactions on Signal Processing, 2012, 8(1): 21-31. |
[85] |
CHU W S, DE LA TORRE F, COHN J F. Selective transfer machine for personalized facial expression analysis[J]. IEEE Transactions on Pattern Analysis Machine Intelligence, 2016, 39(3): 529-545.
DOI URL |
[86] | BALTRUŠAITIS T, MAHMOUD M, ROBINSON P. Cross-dataset learning and person-specific normalisation for auto-matic action unit detection[C]// Proceedings of the 11th IEEE International Conference and Workshops on Auto-matic Face and Gesture Recognition, Ljubljana, May 4-8, 2015. Washington: IEEE Computer Society, 2015: 1-6. |
[87] |
AHSAN T, JABID T, CHONG U P. Facial expression re-cognition using local transitional pattern on Gabor filtered facial images[J]. IETE Technical Review, 2013, 30(1): 47-52.
DOI URL |
[88] | 刘军, 景晓军, 孙松林, 等. 一种用于人脸识别的基于主导近邻像素的局部Gabor空间直方图特征[J]. 北京邮电大学学报, 2015, 38(1): 51-54. |
LIU J, JING X J, SUN S L, et al. Feature of local gabor spatial histogram based on dominant neighboring pixel for face recognition[J]. Journal of Beijing University of Posts and Telecommunications, 2015, 38(1): 51-54.
DOI |
|
[89] |
BAH S M, MING F. An improved face recognition algori-thm and its application in attendance management system[J]. Array, 2020, 5: 100014.
DOI URL |
[90] | DEEBA F, AHMED A, MEMON H, et al. LBPH-based enhanced real-time face recognition[J]. International Journal of Advanced Computer Science and Applications, 2019, 10(5): 274-280. |
[91] |
ZHANG T, ZHENG W, CUI Z, et al. A deep neural network-driven feature learning method for multiview facial expres-sion recognition[J]. IEEE Transactions on Multimedia, 2016, 18(12): 2528-2536.
DOI URL |
[92] | YEASIN M, BULLOT B, SHARMA R. From facial expres-sion to level of interest: a spatio-temporal approach[C]// Pro-ceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, Jun 27-Jul 2, 2004. Washington: IEEE Computer Society, 2004: 922-927. |
[93] | FAN X, TJAHJADI T. A spatial-temporal framework based on histogram of gradients and optical flow for facial ex-pression recognition in video sequences[J]. Pattern Recog-nition, 2015, 48(11): 3407-3416. |
[94] | BOSCH A, ZISSERMAN A, MUÑOZ X. Representing shape with a spatial pyramid kernel[C]// Proceedings of the 6th ACM International Conference on Image and Video Retrieval, Amsterdam, Jul 9-11, 2007. New York: ACM, 2007: 401-408. |
[95] | 刘涛, 周先春, 严锡君. 基于光流特征与高斯LDA的面部表情识别算法[J]. 计算机科学, 2018, 45(10): 286-290. |
LIU T, ZHOU X C, YAN X J. LDA facial expression re-cognition algorithm combining optical flow characteris-tics with Gaussian[J]. Computing Science, 2018, 45(10): 286-290. | |
[96] | HAPPY S, ROUTRAY A. Fuzzy histogram of optical flow orientations for micro-expression recognition[J]. IEEE Tran-sactions on Affective Computing, 2017, 10(3): 394-406. |
[97] | 邵洁, 董楠. RGB-D动态序列的人脸自然表情识别[J]. 计算机辅助设计与图形学学报, 2015, 27(5): 847-854. |
SHAO J, DONG N. Spontaneous facial expression recogni-tion based on RGB-D dynamic sequences[J]. Journal of Computer-Aided Design & Computer Graphics, 2015, 27(5): 847-854. | |
[98] |
YI J, CHEN A, CAI Z, et al. Facial expression recognition of intercepted video sequences based on feature point move-ment trend and feature block texture variation[J]. Applied Soft Computing, 2019, 82: 105540.
DOI URL |
[99] |
YOLCU G, OZTEL I, KAZAN S, et al. Facial expression recognition for monitoring neurological disorders based on convolutional neural network[J]. Multimedia Tools and Applications, 2019, 78(22): 31581-31603.
DOI URL |
[100] |
SUN N, LI Q, HUAN R, et al. Deep spatial-temporal feature fusion for facial expression recognition in static images[J]. Pattern Recognition Letters, 2019, 119: 49-61.
DOI URL |
[101] | 张鹏, 孔韦韦, 滕金保. 基于多尺度特征注意力机制的人脸表情识别[J]. 计算机工程与应用, 2022, 58(1): 182-189. |
ZHANG P, KONG W W, TENG J B. Facial expression recognition based on multi-scale feature attention mecha-nism[J]. Computer Engineering and Applications, 2022, 58(1): 182-189. | |
[102] | SEPAS-MOGHADDAM A, ETEMAD S A, PEREIRA F, et al. Facial emotion recognition using light field images with deep attention-based bidirectional LSTM[C]// Pro-ceedings of the IEEE 2020 International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 3367-3371. |
[103] | 崔子越, 皮家甜, 陈勇, 等. 结合改进VGGNet和Focal Loss的人脸表情识别[J]. 计算机工程与应用, 2021, 57(19): 171-178. |
CUI Z Y, PI J T, CHEN Y, et al. Facial expression recog-nition combined with improved VGGNet and Focal Loss[J]. Computer Engineering and Applications, 2021, 57(19): 171-178. | |
[104] | 郑剑, 郑炽, 刘豪, 等. 融合局部特征与两阶段注意力权重学习的面部表情识别[J]. 计算机应用研究, 2022, 39(3): 889-894. |
ZHENG J, ZHENG C, LIU H, et al. Deep convolutional neural network fusing local feature and two-stage atten-tion weight learning for facial expression recognition[J]. Application Research of Computers, 2022, 39(3): 889-894. | |
[105] | JUNG H, LEE S, YIM J, et al. Joint fine-tuning in deep neural networks for facial expression recognition[C]// Pro-ceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 2983-2991. |
[106] | JAISWAL S, VALSTAR M F. Deep learning the dynamic appearance and shape of facial action units[C]// Procee-dings of the 2016 IEEE Winter Conference on Applica-tions of Computer Vision, Lake Placid, Mar 7-10, 2016. Washington: IEEE Computer Society, 2016: 1-8. |
[107] | FAN Y, LU X J, LI D, et al. Video-based emotion recog-nition using CNN-RNN and C3D hybrid networks[C]// Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Nov 12-16, 2016. New York: ACM, 2016: 445-450. |
[108] |
KIM D H, BADDAR W J, JANG J, et al. Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression re-cognition[J]. IEEE Transactions on Affective Computing, 2017, 10(2): 223-236.
DOI URL |
[109] |
YU Z, LIU G, LIU Q, et al. Spatio-temporal convolutional features with nested LSTM for facial expression recogni-tion[J]. Neurocomputing, 2018, 317: 50-57.
DOI URL |
[110] |
LIANG D, LIANG H, YU Z, et al. Deep convolutional BiLSTM fusion network for facial expression recognition[J]. The Visual Computer, 2020, 36(3): 499-508.
DOI URL |
[111] | 司马懿, 易积政, 陈爱斌, 等. 动态人脸图像序列中表情完全帧的定位与识别[J]. 应用科学学报, 2021, 39(3): 357-366. |
SIMA Y, YI J Z, CHEN A B, et al. Fully expression frame localization and recognition based on dynamic face image sequences[J]. Journal of Applied Sciences, 2021, 39(3): 357-366. | |
[112] | MENG D B, PENG X J, WANG K, et al. Frame attention networks for facial expression recognition in videos[C]// Proceedings of the 2019 IEEE International Conference on Image Processing, Taipei, China, Sep 22-25, 2019. Pis-cataway: IEEE, 2019: 3866-3870. |
[113] |
PAN X, ZHANG S, GUO W, et al. Video-based facial ex-pression recognition using deep temporal-spatial networks[J]. IETE Technical Review, 2020, 37(4): 402-409.
DOI URL |
[114] | SOUMYA G K, JOSEPH S. Text classification by aug-menting bag of words (BOW) representation with co-occurrence feature[J]. IOSR Journal of Computer Engi-neering, 2014, 16(1): 34-38. |
[115] | ZHAO R, MAO K. Fuzzy bag-of-words model for docu-ment representation[J]. IEEE Transactions on Fuzzy Sys-tems, 2017, 26(2): 794-804. |
[116] |
TRSTENJAK B, MIKAC S, DONKO D. KNN with TF-IDF based framework for text categorization[J]. Procedia Engineering, 2014, 69: 1356-1364.
DOI URL |
[117] | KIM D, SEO D, CHO S, et al. Multi-co-training for docu-ment classification using various document represen-tations: TFIDF, LDA, and Doc2Vec[J]. Information Scien-ces, 2019, 477: 15-29. |
[118] | DEERWESTER S C, DUMAIS S T, LANDAUER T K, et al. Indexing by latent semantic analysis[J]. Journal of the Association for Information Science & Technology, 1990, 41(6): 391-407. |
[119] | HOFMANN T. Probabilistic latent semantic analysis[C]// Proceedings of the 15th Conference on Uncertainty in A.pngicial Intelligence, Stockholm, Jul 30-Aug 1, 1999. San Mateo: Morgan Kaufmann, 1999: 289-296. |
[120] | BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022. |
[121] | DENG J W, REN F J. A survey of textual emotion recogni-tion and its challenges[J]. IEEE Transactions on Affective Computing, 2021: 1. |
[122] | MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distribu-ted representations of words and phrases and their com-positionality[C]// Advances in Neural Information Proces-sing Systems 26, Lake Tahoe, Dec 5-8, 2013. Red Hook: Curran Associates, 2013: 3111-3119. |
[123] | PENNINGTON J, SOCHER R, MANNING C D. Glove: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Oct 25-29, 2014. Stroudsburg: ACL, 2014: 1532-1543. |
[124] | PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, New Orleans, Jun 1-6, 2018. St-roudsburg: ACL, 2018: 2227-2237. |
[125] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Advances in Neural Information Pro-cessing Systems 30, Dec 4-9, 2017. Red Hook: Curran As-sociates, 2017: 5998-6008. |
[126] | CHUNG Y A, GLASS J R. Generative pre-training for speech with autoregressive predictive coding[C]// Procee-dings of the 2020 IEEE International Conference on Acou-stics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 3497-3501. |
[127] | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Com-putational Linguistics:Human Language Technologies, Min-neapolis, Jun 2-7, 2019. Stroudsburg: ACL, 2019: 4171-4186. |
[128] | RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9. |
[129] | BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]// Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1877-1901. |
[130] | DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: attentive language models beyond a fixed-length context[C]// Proceedings of the 57th Conference of the Associa-tion for Computational Linguistics, Florence, Jul 28-Aug 2, 2019. Stroudsburg: ACL, 2019: 2978-2988. |
[131] | YANG Z L, DAI Z H, YANG Y M, et al. XLNet: genera-lized autoregressive pretraining for language understan-ding[C]// Advances in Neural Information Processing Sys-tems 32, Vancouver, Dec 8-14, 2019: 5754-5764. |
[132] | TANG D, WEI F, YANG N, et al. Learning sentiment-specific word embedding for twitter sentiment classifica-tion[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2014: 1555-1565. |
[133] | FELBO B, MISLOVE A, SØGAARD A, et al. Using mil-lions of emoji occurrences to learn any-domain represen-tations for detecting sentiment, emotion and sarcasm[C]// Proceedings of the 2017 Conference on Empirical Me-thods in Natural Language Processing, Copenhagen, Sep 9-11, 2017. Stroudsburg: ACL, 2017: 1615-1625. |
[134] | XU P, MADOTTO A, WU C S, et al. Emo2Vec: learning generalized emotion representation by multi-task training[C]// Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Brussels, Oct 31, 2018. Stroudsburg: ACL, 2018: 292-298. |
[135] | SHI B, FU Z, BING L, et al. Learning domain-sensitive and sentiment-aware word embeddings[C]// Proceedings of the 56th Annual Meeting of the Association for Compu-tational Linguistics, Melbourne, Jul 15-20, 2018. Strouds-burg: ACL, 2018: 2494-2504. |
[136] |
ABDULLAH S M S A, AMEEN S Y A, SADEEQ M A, et al. Multimodal emotion recognition using deep learning[J]. Journal of Applied Science and Technology Trends, 2021, 2(2): 52-58.
DOI URL |
[137] |
SHOUMY N J, ANG L M, SENG K P, et al. Multimodal big data affective analytics: a comprehensive survey using text, audio, visual and physiological signals[J]. Journal of Network and Computer Applications, 2020, 149: 102447.
DOI URL |
[138] | SUN Z, SONG Q, ZHU X, et al. A novel ensemble me-thod for classifying imbalanced data[J]. Pattern Recogni-tion, 2015, 48(5): 1623-1637. |
[139] | HUANG J, TAO J H, LIU B, et al. Multimodal transfo-rmer fusion for continuous emotion recognition[C]// Procee-dings of the IEEE 2020 International Conference on Acou-stics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 3507-3511. |
[140] | RINGEVAL F, SCHULLER B W, VALSTAR M F, et al. AVEC 2017:reallife depression, and affect recognition workshop and challenge[C]// Proceedings of the 7th Ann-ual Workshop on Audio/Visual Emotion Challenge, Moun-tain View, Oct 23-27, 2017. New York: ACM, 2017: 3-9. |
[141] | 刘菁菁, 吴晓峰. 基于长短时记忆网络的多模态情感识别和空间标注[J]. 复旦学报(自然科学版), 2020, 59(5): 565-574. |
LIU J J, WU X F. Real-time multimodal emotion recogni-tion and emotion space labeling using LSTM networks[J]. Journal of Fudan University (Natural Science), 2020, 59(5): 565-574. | |
[142] | LIU J X, CHEN S, WANG L B, et al. Multimodal emotion recognition with capsule graph convolutional based repre-sentation fusion[C]// Proceedings of the 2021 IEEE Inter-national Conference on Acoustics, Speech and Signal Pro-cessing, Toronto, Jun 6-11, 2021. Piscataway: IEEE, 2021: 6339-6343. |
[143] | 王传昱, 李为相, 陈震环. 基于语音和视频图像的多模态情感识别研究[J]. 计算机工程与应用, 2021, 57(23): 163-170. |
WANG C Y, LI W X, CHEN Z H. Reserch of multi-modal emotion recognition based on voice and video images[J]. Computer Engineering and Applications, 2021, 57(23): 163-170. | |
[144] | HAZARIKA D, GORANTLA S, PORIA S, et al. Self-attentive feature-level fusion for multimodal emotion detec-tion[C]// Proceedings of the IEEE 1st Conference on Multi-media Information Processing and Retrieval, Miami, Apr 10-12, 2018. Piscataway: IEEE, 2018: 196-201. |
[145] |
BOJANOWSKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with subword information[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 135-146.
DOI URL |
[146] | PRIYASAD D, FERNANDO T, DENMAN S, et al. Attention driven fusion for multi-modal emotion recog-nition[C]// Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, May 4-8, 2020. Piscataway: IEEE, 2020: 3227-3231. |
[147] | KRISHNA D N, PATIL A. Multimodal emotion recogni-tion using cross-modal attention and 1D convolutional neural networks[C]// Proceedings of the 21st Annual Con-ference of the International Speech Communication Asso-ciation, Shanghai, Oct 25-29, 2020: 4243-4247. |
[148] | LIAN Z, LIU B, TAO J H. CTNet: conversational transfor-mer network for emotion recognition[J]. IEEE/ACM Tran-sactions on Audio, Speech, Language Processing, 2021, 29: 985-1000. |
[149] | 王兰馨, 王卫亚, 程鑫. 结合Bi-LSTM-CNN的语音文本双模态情感识别模型[J]. 计算机工程与应用, 2022, 58(4): 192-197. |
WANG L X, WANG W Y, CHENG X. Bimodal emotion recognition model for speech-text based on Bi-LSTM-CNN[J]. Computer Engineering and Applications, 2022, 58(4): 192-197. | |
[150] | MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:13013781, 2013. |
[151] | PORIA S, CAMBRIA E, HAZARIKA D, et al. Multi-level multiple attentions for contextual multimodal sentiment analysis[C]// Proceedings of the 2017 IEEE International Conference on Data Mining, New Orleans, Nov 18-21, 2017. Washington: IEEE Computer Society, 2017: 1033-1038. |
[152] | PAN Z X, LUO Z J, YANG J C, et al. Multi-modal atten-tion for speech emotion recognition[C]// Proceedings of the 21st Annual Conference of the International Speech Com-munication Association, Oct 25-29, 2020: 364-368. |
[153] | MITTAL T, BHATTACHARYA U, CHANDRA R, et al. M3ER:multiplicative multimodal emotion recognition using facial, textual, and speech cues[C]// Proceedings of the 34th AAAI Conference on A.pngicial Intelligence, the 32nd Innovative Applications of A.pngicial Intelligence Confe-rence, the 10th AAAI Symposium on Educational Advan-ces in A.pngicial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 1359-1367. |
[154] |
SIRIWARDHANA S, KALUARACHCHI T, BILLINGH-URST M, et al. Multimodal emotion recognition with transformer-based self supervised feature fusion[J]. IEEE Access, 2020, 8: 176274-176285.
DOI URL |
[155] | LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly op-timized BERT pretraining approach[J]. arXiv:190711692, 2019. |
[156] | ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehen-sion[C]// Proceedings of the 32nd AAAI Conference on A.pngicial Intelligence, the 30th Innovative Applications of A.pngicial Intelligence, and the 8th AAAI Symposium on Educational Advances in A.pngicial Intelligence, New Or-leans, Feb 2-7, 2018. Menlo Park: AAAI, 2018: 5642-5649. |
[157] |
MAI S J, HU H F, XU J, et al. Multi-fusion residual me-mory network for multimodal human sentiment compre-hension[J]. IEEE Transactions on Affective Computing, 2022, 13(1): 320-334.
DOI URL |
[158] | MAAS A, DALY R E, PHAM P T, et al. Learning word vectors for sentiment analysis[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Lin-guistics:Human Language Technologies, Portland, Jun 19-24, 2011. Stroudsburg: ACL: 142-150. |
[159] | WANG Z L, WAN Z H, WAN X J. TransModality: an End2End fusion method with transformer for multimodal sentiment analysis[C]// Proceedings of the Web Conference 2020, Taipei, China, Apr 20-24, 2020. New York: ACM, 2020: 2514-2520. |
[160] | DAI W L, CAHYAWIJAYA S, LIU Z H, et al. Multimodal end-to-end sparse model for emotion recognition[C]// Pro-ceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Jun 6-11, 2021. Strouds-burg: ACL, 2021: 5305-5316. |
[161] |
REN M, HUANG X, SHI X, et al. Interactive multimodal attention network for emotion recognition in conversation[J]. IEEE Signal Processing Letters, 2021, 28: 1046-1050.
DOI URL |
[162] | KHARE A, PARTHASARATHY S, SUNDARAM S. Self-supervised learning with cross-modal transformers for emo-tion recognition[C]// Proceedings of the 2021 IEEE Spo-ken Language Technology Workshop, Shenzhen, Jan 19-22, 2021. Piscataway: IEEE, 2021: 381-388. |
[163] | HE Y H, ZHANG X Y, SUN J. Channel pruning for accelerating very deep neural networks[C]// Proceedings of the 2017 IEEE International Conference on Computer Vi-sion, Venice, Oct 22-29, 2017. Washington: IEEE Compu-ter Society, 2017: 1389-1397. |
[164] | LI H, KADAV A, DURDANOVIC I, et al. Pruning filters for efficient ConvNets[J]. arXiv:1608. 08710, 2016. |
[165] | ESCALANTE H J, KAYA H, SALAH A A, et al. Mode-ling, recog nizing, and explaining apparent personality from videos[J]. IEEE Transactions on Affective Compu-ting, 2020: 1. |
[166] |
ANGELOV P, SOARES E. Towards explainable deep neural networks (xDNN)[J]. Neural Networks, 2020, 130: 185-194.
DOI URL |
[167] | GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]// Advances in Neural Information Processing Systems 27, Montreal, Dec 8-13, 2014. Red Hook: Curran Associates, 2014: 2672-2680. |
[168] | MAKHZANI A, SHLENS J, JAITLY N, et al. Adversarial auto encoders[J]. arXiv:1511. 05644, 2015. |
[169] | MAI S J, HU H F, XING S L. Modality to modality tran-slation: an adversarial representation learning and graph fusion network for multimodal fusion[C]// Proceedings of the 34th AAAI Conference on A.pngicial Intelligence, the 32nd Innovative Applications of A.pngicial Intelligence Con-ference, the 10th AAAI Symposium on Educational Ad-vances in A.pngicial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 164-172. |
[170] | 王忠民, 赵玉鹏, 郑镕林, 等. 脑电信号情绪识别研究综述[J]. 计算机科学与探索, 2022, 16(4): 760-774. |
WANG Z M, ZHAO Y P, ZHENG R L, et al. A survey of research on EGG signal emotion recognition[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 760-774. | |
[171] | YANG C J, FAHIER N, LI W C, et al. A convolution neural network based emotion recognition system using multimodal physiological signals[C]// Proceedings of the 2020 IEEE International Conference on Consumer Elec-tronics, Taoyuan, China, Sep 28-30, 2020. Piscataway: IEEE, 2020: 1-2. |
[172] | WU J, ZHANG Y, ZHAO X, et al. A generalized zero-shot framework for emotion recognition from body gestures[J]. arXiv: 2010. 06362, 2020. |
[173] |
GAO J, LI P, CHEN Z K, et al. A survey on deep learning for multimodal data fusion[J]. Neural Computation, 2020, 32(5): 829-864.
DOI URL |
[1] | 安凤平, 李晓薇, 曹翔. 权重初始化-滑动窗口CNN的医学图像分类[J]. 计算机科学与探索, 2022, 16(8): 1885-1897. |
[2] | 曾凡智, 许露倩, 周燕, 周月霞, 廖俊玮. 面向智慧教育的知识追踪模型研究综述[J]. 计算机科学与探索, 2022, 16(8): 1742-1763. |
[3] | 洪惠群, 沈贵萍, 黄风华. 表情识别技术综述[J]. 计算机科学与探索, 2022, 16(8): 1764-1778. |
[4] | 刘艺, 李蒙蒙, 郑奇斌, 秦伟, 任小广. 视频目标跟踪算法综述[J]. 计算机科学与探索, 2022, 16(7): 1504-1515. |
[5] | 夏鸿斌, 肖奕飞, 刘渊. 融合自注意力机制的长文本生成对抗网络模型[J]. 计算机科学与探索, 2022, 16(7): 1603-1610. |
[6] | 彭豪, 李晓明. 多尺度选择金字塔网络的小样本目标检测算法[J]. 计算机科学与探索, 2022, 16(7): 1649-1660. |
[7] | 刘雅芬, 郑艺峰, 江铃燚, 李国和, 张文杰. 深度半监督学习中伪标签方法综述[J]. 计算机科学与探索, 2022, 16(6): 1279-1290. |
[8] | 刘颖, 王哲, 房杰, 朱婷鸽, 李琳娜, 刘继明. 基于图文融合的多模态舆情分析[J]. 计算机科学与探索, 2022, 16(6): 1260-1278. |
[9] | 赵运基, 范存良, 张新良. 融合多特征和通道感知的目标跟踪算法[J]. 计算机科学与探索, 2022, 16(6): 1417-1428. |
[10] | 孙方伟, 李承阳, 谢永强, 李忠博, 杨才东, 齐锦. 深度学习应用于遮挡目标检测算法综述[J]. 计算机科学与探索, 2022, 16(6): 1243-1259. |
[11] | 程卫月, 张雪琴, 林克正, 李骜. 融合全局与局部特征的深度卷积神经网络算法[J]. 计算机科学与探索, 2022, 16(5): 1146-1154. |
[12] | 钟梦圆, 姜麟. 超分辨率图像重建算法综述[J]. 计算机科学与探索, 2022, 16(5): 972-990. |
[13] | 朱伟杰, 陈莹. 双流时间域信息交互的微表情识别卷积网络[J]. 计算机科学与探索, 2022, 16(4): 950-958. |
[14] | 裴利沈, 赵雪专. 群体行为识别深度学习方法研究综述[J]. 计算机科学与探索, 2022, 16(4): 775-790. |
[15] | 许嘉, 韦婷婷, 于戈, 黄欣悦, 吕品. 题目难度评估方法研究综述[J]. 计算机科学与探索, 2022, 16(4): 734-759. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||