计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (7): 1479-1503.DOI: 10.3778/j.issn.1673-9418.2112081

• 综述·探索 • 上一篇    下一篇

面向深度学习的多模态情感识别研究进展

赵小明1,2,+(), 杨轶娇1, 张石清2   

  1. 1.浙江科技学院 理学院,杭州 310000
    2.台州学院 智能信息处理研究所,浙江 台州 318000
  • 收稿日期:2021-12-20 修回日期:2022-02-14 出版日期:2022-07-01 发布日期:2022-03-09
  • 作者简介:赵小明(1964—),男,浙江临海人,硕士,教授,主要研究方向为音频和图像处理、机器学习、模式识别等。
    ZHAO Xiaoming, born in 1964, M.S., professor. His research interests include audio and image processing, machine learning, pattern recogni-tion, etc.
    杨轶娇(1997—),女,江苏南通人,硕士研究生,主要研究方向为情感计算、模式识别等。
    YANG Yijiao, born in 1997, M.S. candidate. Her research interests include emotional computing, pattern recognition, etc.
    张石清(1980—),男,湖南衡阳人,博士,教授,主要研究方向为情感计算、模式识别等。
    ZHANG Shiqing, born in 1980, Ph.D., professor. His research interests include emotional compu-ting, pattern recognition, etc.
  • 基金资助:
    浙江省自然科学基金重点项目(LZ20F020002);国家自然科学基金面上项目(61976149)

Survey of Deep Learning Based Multimodal Emotion Recognition

ZHAO Xiaoming1,2,+(), YANG Yijiao1, ZHANG Shiqing2   

  1. 1. School of Science, Zhejiang University of Science and Technology, Hangzhou 310000, China
    2. Institute of Intelligent Information Processing, Taizhou University, Taizhou, Zhejiang 318000, China
  • Received:2021-12-20 Revised:2022-02-14 Online:2022-07-01 Published:2022-03-09
  • Supported by:
    the Natural Science Foundation of Zhejiang Province(LZ20F020002);the National Natural Science Foundation of China(61976149)

摘要:

多模态情感识别是指通过与人类情感表达相关的语音、视觉、文本等不同模态信息来识别人的情感状态。该研究在人机交互、人工智能、情感计算等领域有着重要的研究意义,备受研究者关注。鉴于近年来发展起来的深度学习方法在各种任务中所取得的巨大成功,目前各种深度神经网络已被用于学习高层次的情感特征表示,用于多模态情感识别。为了系统地总结深度学习方法在多模态情感识别领域中的研究现状,拟对近年来面向深度学习的多模态情感识别研究文献进行分析与归纳。首先,给出了多模态情感识别的一般框架,并介绍了常用的多模态情感数据集。然后,简要回顾了代表性深度学习技术的原理及其进展。随后,重点详细介绍了多模态情感识别中的两个关键步骤的研究进展:与语音、视觉、文本等不同模态相关的情感特征提取方法,包括手工特征和深度特征;融合不同模态信息的多模态信息融合策略。最后,分析了该领域面临的挑战和机遇,并指出了未来的发展方向。

关键词: 情感识别, 多模态, 深度学习, 手工特征, 深度特征, 融合

Abstract:

Multimodal emotion recognition aims to recognize human emotional states through different modalities related to human emotion expression such as audio, vision, text, etc. This topic is of great importance in the fields of human-computer interaction, a.pngicial intelligence, affective computing, etc., and has attracted much attention. In view of the great success of deep learning methods developed in recent years in various tasks, a variety of deep neural networks have been used to learn high-level emotional feature representations for multimodal emotion recog-nition. In order to systematically summarize the research advance of deep learning methods in the field of multi-modal emotion recognition, this paper aims to present comprehensive analysis and summarization on recent multi-modal emotion recognition literatures based on deep learning. First, the general framework of multimodal emotion recognition is given, and the commonly used multimodal emotional dataset is introduced. Then, the principle of representative deep learning techniques and its advance in recent years are briefly reviewed. Subsequently, this paper focuses on the advance of two key steps in multimodal emotion recognition: emotional feature extraction methods related to audio, vision, text, etc., including hand-crafted feature extraction and deep feature extraction; multi-modal information fusion strategies integrating different modalities. Finally, the challenges and opportunities in this field are analyzed, and the future development direction is pointed out.

Key words: emotion recognition, multimodal, deep learning, hand-crafted feature, deep feature, fusion

中图分类号: