Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (10): 2501-2520.DOI: 10.3778/j.issn.1673-9418.2403083

• Frontiers·Surveys • Previous Articles     Next Articles

Survey of Multimodal Data Fusion Research

ZHANG Hucheng, LI Leixiao, LIU Dongjiang   

  1. 1. College of Data Science and Application, Inner Mongolia University of Technology, Hohhot 010080, China
    2. Inner Mongolia Autonomous Region Software Service Engineering Technology Research Center Based on Big Data, Hohhot 010080, China
  • Online:2024-10-01 Published:2024-09-29

多模态数据融合研究综述

张虎成,李雷孝,刘东江   

  1. 1. 内蒙古工业大学 数据科学与应用学院,呼和浩特 010080
    2. 内蒙古自治区基于大数据的软件服务工程技术研究中心,呼和浩特 010080

Abstract: Although the powerful learning ability of deep learning has achieved excellent results in the field of single-modal applications, it has been found that the feature representation of a single modality is difficult to fully contain the complete information of a phenomenon. In order to break through the obstacles of feature representation on a single modality and make greater use of the value contained in multiple modalities, scholars have begun to propose the use of multimodal fusion to improve model learning performance. Multimodal fusion technology is to make the machine use the correlation and complementarity between modalities to fuse into a better feature representation in text, speech, image and video, which provides a basis for model training. At present, the research of multimodal fusion is still in the early stage of development. This paper starts from the hot research field of multimodal fusion in recent years, and expounds the multimodal fusion method and the multimodal alignment technology in the fusion process. Firstly, the application, advantages and disadvantages of joint fusion method, cooperative fusion method, encoder fusion method and split fusion method in multimodal fusion are analyzed. The problem of multimodal alignment in the fusion process is expounded, including explicit alignment and implicit alignment, as well as the application, advantages and disadvantages. Secondly, it expounds the application of popular datasets in multimodal fusion in different fields in recent years. Finally, the challenges and research prospects of multimodal fusion are expounded to further promote the development and application of multimodal fusion.

Key words: deep learning, multimodal fusion, modal alignment, multimodal applications

摘要: 尽管深度学习强大的学习能力已经在单一模态应用领域取得了优异成果,但研究发现单一模态的特征表示很难完整包含某个现象的完整信息。为了突破在单一模态上特征表示的阻碍,更大化利用多种模态所蕴含的价值,学者们开始提出利用多模态融合的方式去提高模型学习性能。多模态融合技术是让机器在文本、语音、图像和视频中利用模态之间的相关性和互补性融合成更好的特征表示,为模型训练提供基础。目前多模态融合的研究仍处在发展初期阶段,从近几年多模态融合的热门研究领域为出发点,阐述多模态融合方法和融合过程中的多模态对齐技术。重点分析多模态融合方法中的联合融合方法、协同融合方法、编码器融合方法和分裂融合方法在多模态融合中的应用情况与优缺点,阐述在融合过程中的多模态对齐的问题,包括显式对齐和隐式对齐以及应用情况与优缺点。阐述近几年多模态融合领域中热门数据集在不同领域的应用。阐述多模态融合所面临的挑战以及研究展望,以进一步推动多模态融合的发展与应用。

关键词: 深度学习, 多模态融合, 模态对齐, 多模态应用