Survey of Multimodal Data Fusion Research

doi:10.3778/j.issn.1673-9418.2403083

Abstract

Abstract: Although the powerful learning ability of deep learning has achieved excellent results in the field of single-modal applications, it has been found that the feature representation of a single modality is difficult to fully contain the complete information of a phenomenon. In order to break through the obstacles of feature representation on a single modality and make greater use of the value contained in multiple modalities, scholars have begun to propose the use of multimodal fusion to improve model learning performance. Multimodal fusion technology is to make the machine use the correlation and complementarity between modalities to fuse into a better feature representation in text, speech, image and video, which provides a basis for model training. At present, the research of multimodal fusion is still in the early stage of development. This paper starts from the hot research field of multimodal fusion in recent years, and expounds the multimodal fusion method and the multimodal alignment technology in the fusion process. Firstly, the application, advantages and disadvantages of joint fusion method, cooperative fusion method, encoder fusion method and split fusion method in multimodal fusion are analyzed. The problem of multimodal alignment in the fusion process is expounded, including explicit alignment and implicit alignment, as well as the application, advantages and disadvantages. Secondly, it expounds the application of popular datasets in multimodal fusion in different fields in recent years. Finally, the challenges and research prospects of multimodal fusion are expounded to further promote the development and application of multimodal fusion.

Key words: deep learning, multimodal fusion, modal alignment, multimodal applications

摘要： 尽管深度学习强大的学习能力已经在单一模态应用领域取得了优异成果，但研究发现单一模态的特征表示很难完整包含某个现象的完整信息。为了突破在单一模态上特征表示的阻碍，更大化利用多种模态所蕴含的价值，学者们开始提出利用多模态融合的方式去提高模型学习性能。多模态融合技术是让机器在文本、语音、图像和视频中利用模态之间的相关性和互补性融合成更好的特征表示，为模型训练提供基础。目前多模态融合的研究仍处在发展初期阶段，从近几年多模态融合的热门研究领域为出发点，阐述多模态融合方法和融合过程中的多模态对齐技术。重点分析多模态融合方法中的联合融合方法、协同融合方法、编码器融合方法和分裂融合方法在多模态融合中的应用情况与优缺点，阐述在融合过程中的多模态对齐的问题，包括显式对齐和隐式对齐以及应用情况与优缺点。阐述近几年多模态融合领域中热门数据集在不同领域的应用。阐述多模态融合所面临的挑战以及研究展望，以进一步推动多模态融合的发展与应用。

关键词: 深度学习, 多模态融合, 模态对齐, 多模态应用

ZHANG Hucheng, LI Leixiao, LIU Dongjiang. Survey of Multimodal Data Fusion Research[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2501-2520.

张虎成, 李雷孝, 刘东江. 多模态数据融合研究综述[J]. 计算机科学与探索, 2024, 18(10): 2501-2520.

References

[1] DING M, YANG Z, HONG W, et al. CogView: mastering text-to-image generation via transformers[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 19822-19835.
[2] LIU S, FAN H, QIAN S, et al. HiT: hierarchical transformer with momentum contrast for video-text retrieval[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 11915-11925.
[3] MA L, LU Z, LI H. Learning to answer questions from image using convolutional neural network[C]//Proceedings of the 2016 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2016: 3567-3573.
[4] MARGE M, ESPY-WILSON C, WARD N G, et al. Spoken language interaction with robots: recommendations for future research[J]. Computer Speech & Language, 2022, 71: 101255.
[5] LIANG P P, LYU Y, FAN X, et al. MultiBench: multiscale benchmarks for multimodal representation learning[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2107.07502.
[6] HUANG Y, DU C, XUE Z, et al. What makes multi-modal learning better than single (provably)[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 10944-10956.
[7] KARLE P, FENT F, HUCH S, et al. Multi-modal sensor fusion and object tracking for autonomous racing[J]. IEEE Transactions on Intelligent Vehicles, 2023, 8(7): 3871-3883.
[8] XIE J, WANG J, WANG Q, et al. A multimodal fusion emotion recognition method based on multitask learning and attention mechanism[J]. Neurocomputing, 2023, 556: 126649.
[9] XU P, ZHU X, CLIFTON D A. Multimodal learning with transformers: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10): 12113-12132.
[10] LIANG P P, MORENCY L P. Tutorial on multimodal machine learning: principles, challenges, and open questions[C]//Proceedings of the 2023 International Conference on Multimodal Interaction, Paris, Oct 9-13, 2023. New York: ACM,2023: 101-104.
[11] BALTRU?AITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern analysis and Machine Intelligence, 2018, 41(2): 423-443.
[12] LIU K, FENG G, JIANG X, et al. A feature fusion method for driving fatigue of shield machine drivers based on multiple physiological signals and auto-encoder[J]. Sustainability, 2023, 15(12): 9405.
[13] 王梓衡, 沈继锋, 左欣,等. 基于特征级与决策级融合的农作物叶片病害识别[J]. 江苏大学学报(自然科学版), 2024, 45(3): 286-294.
WANG Z H, SHEN J F, ZUO X, et al. Crop leaf disease recognition based on feature-level and decision-level fusion[J]. Journal of Jiangsu University (Natural Science Edition), 2024, 45(3): 286-294.
[14] HE C, XU P, PEI X, et al. Fatigue at the wheel: a non-visual approach to truck driver fatigue detection by multi-feature fusion[J]. Accident Analysis & Prevention, 2024, 199: 107511.
[15] ZHANG N, WU H, ZHU H, et al. Tomato disease classification and identification method based on multimodal fusion deep learning[J]. Agriculture, 2022, 12(12): 2014.
[16] YEH Y R, LIN T C, CHUNG Y Y, et al. A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection[J]. IEEE Transactions on Multimedia, 2012, 14(3): 563-574.
[17] WANG M, SHAO W, HUANG S, et al. Hypergraph-regularized multimodal learning by graph diffusion for imaging genetics based Alzheimer??s disease diagnosis[J]. Medical Image Analysis, 2023, 89: 102883.
[18] MCFEE B, LANCKRIET G, JEBARA T. Learning multi-modal similarity[J]. Journal of Machine Learning Research, 2011, 12(2): 491-523.
[19] 陈辉, 王硕, 许家昌, 等. 基于多尺度特征融合生成对抗网络的水下图像增强[J]. 计算机工程与应用, 2023, 59(21): 231-241.
CHEN H, WANG S, XU J C, et al. Underwater image enhancement based on generate adversarial network with multiscale feature fusion[J]. Computer Engineering and Applications, 2023, 59(21): 231-241.
[20] ZHAO Z, BAI H, ZHU Y, et al. DDFM: denoising diffusion model for multi-modality image fusion[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 8082-8093.
[21] LI Y, QI T, MA Z, et al. Seeking a hierarchical prototype for multimodal gesture recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023. DOI: 10.1109/TNNLS. 2023.3295811.
[22] ZHANG H, KOH J Y, BALDRIDGE J, et al. Cross-modal contrastive learning for text-to-image generation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 833-842.
[23] ZHANG X, HE L, CHEN J, et al. Multiattention mechanism 3D object detection algorithm based on RGB and LiDAR fusion for intelligent driving[J]. Sensors, 2023, 23(21): 8732.
[24] CHEN J, HU Y, LAI Q, et al. IIFDD: intra and inter-modal fusion for depression detection with multi-modal information from Internet of medical things[J]. Information Fusion, 2024, 102: 102017.
[25] SINGH A, HU R, GOSWAMI V, et al. FLAVA: a foundational language and vision alignment model[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 15638-15650.
[26] LI J, LI D, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 2023 International Conference on Machine Learning, Honolulu, Jul 23-29, 2023: 19730-19742.
[27] LIANG S, ZHAO M, SCHüTZE H. Modular and parameter-efficient multimodal fusion with prompting[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2203.08055.
[28] KUMARI R, EKBAL A. AMFB: attention based multimodal factorized bilinear pooling for multimodal fake news detection[J]. Expert Systems with Applications, 2021, 184: 115412.
[29] XU Q, MEI Y, LIU J, et al. Multimodal cross-layer bilinear pooling for RGBT tracking[J]. IEEE Transactions on Multimedia, 2021, 24: 567-580.
[30] GOEL T, SHARMA R, TANVEER M, et al. Multimodal neuroimaging based Alzheimer??s disease diagnosis using evolu-tionary RVFL classifier[J]. IEEE Journal of Biomedical and Health Informatics, 2023, 6: 1-9.
[31] HAN G, WANG M, ZHU H, et al. UIEGAN: adversarial learning-based photo-realistic image enhancement for intelligent underwater environment perception[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5611514.
[32] ZHAO C, YANG P, ZHOU F, et al. MHW-GAN: multi-discriminator hierarchical wavelet generative adversarial network for multimodal image fusion[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023. DOI: 10.1109/TNNLS.2023.3271059.
[33] ELAKKIYA R, VIJAYAKUMAR P, KUMAR N. An optimized generative adversarial network based continuous sign language classification[J]. Expert Systems with Applications, 2021, 182: 115276.
[34] YANG B, XIANG X, KONG W, et al. DMF-GAN: deep multimodal fusion generative adversarial networks for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 6956-6967.
[35] FAN H, ZHANG X, XU Y, et al. Transformer-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals[J]. Information Fusion, 2024, 104: 102161.
[36] LIU Y, BING W, REN S, et al. BC-FND: an approach based on hierarchical bilinear fusion and multimodal consistency for fake news detection[J]. IEEE Access, 2024, 12: 62738-62749.
[37] KANG B, LIANG D, MEI J, et al. Robust RGB-T tracking via graph attention-based bilinear pooling[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(12): 9900-9911.
[38] DAS R, SINGH T D. Image-text multimodal sentiment analysis framework of assamese news articles using late fusion[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2023, 22(6): 1-30.
[39] 黄忠, 胡敏, 刘娟. 基于多特征决策级融合的表情识别方法[J]. 计算机工程, 2015, 41(10): 171-176.
HUANG Z, HU M, LIU J. Facial expression recognition method based on multi-feature decision-level fusion[J]. Computer Engineering, 2015, 41(10): 171-176.
[40] 宁大海, 郑晟. 可见光和红外图像决策级融合目标检测算法[J]. 红外技术, 2023, 45(3): 282-291.
NING D H, ZHENG S. An object detection algorithm based on decision-level fusion of visible and infrared images[J]. Infrared Technology, 2023, 45(3): 282-291.
[41] HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 International Conference on Multimodal Interaction, Montréal, Oct 18-22, 2021. New York: ACM, 2021: 6-15.
[42] YANG B, WU L, ZHU J, et al. Multimodal sentiment analysis with two-phase multi-task learning[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2015-2024.
[43] PENG Y, QI J, YUAN Y. Modality-specific cross-modal similarity measurement with recurrent attention network[J]. IEEE Transactions on Image Processing, 2018, 27(11): 5585-5599.
[44] RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia. New York: ACM, 2010: 251-260.
[45] FROME A, CORRADO G S, SHLENS J. DeViSE: a deep visual-semantic embedding model[C]//Advances?in?Neural?Information?Processing?Systems?26, Lake?Tahoe, Dec?5-8, 2013: 2121-2129.
[46] MEKHALDI D. Multimodal document alignment: towards a fully-indexed multimedia archive[C]//Proceedings of the 2007 Multimedia Information Retrieval Workshop, Amsterdam, Jul 23-27, 2007.
[47] WEHRMANN J, MATTJIE A, BARROS R C. Order embeddings and character-level convolutions for multimodal alignment[J]. Pattern Recognition Letters, 2018, 102: 15-22.
[48] SONG G, WANG S, TIAN Q. Fusing feature and similarity for multimodal search[C]//Proceedings of the 2015 IEEE China Summit and International Conference on Signal and Information Processing. Piscataway: IEEE, 2015: 787-791.
[49] HU D, NIE F, LI X. Deep multimodal clustering for unsupervised audiovisual learning[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 9248-9257.
[50] LIANG P P, ZADEH A, MORENCY L P. Foundations & trends in multimodal machine learning: principles, challenges, and open questions[J]. ACM Computing Surveys, 2024, 56(10): 264.
[51] VENDROV I, KIROS R, FIDLER S, et al. Order-embeddings of images and language[EB/OL]. [2024-01-06]. https://arxiv. org/abs/1511.06361.
[52] YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
[53] MAJUMDER N, HAZARIKA D, GELBUKH A, et al. Multi-modal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124-133.
[54] QU Z, WANG C Y, WANG S Y, et al. A method of hierarchical feature fusion and connected attention architecture for pavement crack detection[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(9): 16038-16047.
[55] LI R, WU X, LI A, et al. HFBSurv: hierarchical multimodal fusion with factorized bilinear models for cancer survival prediction[J]. Bioinformatics, 2022, 38(9): 2587-2594.
[56] WANG J, YANG S, ZHAO H. Crisis event summary generative model based on hierarchical multimodal fusion[J]. Pattern Recognition, 2023, 144: 109890.
[57] ZHOU W, DONG S, LEI J, et al. MTANet: multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding[J]. IEEE Transactions on Intelligent Vehicles, 2022, 8(1): 48-58.
[58] 任泽裕, 王振超, 柯尊旺, 等. 多模态数据融合综述[J]. 计算机工程与应用, 2021, 57(18): 49-64.
REN Z Y, WANG Z C, KE Z W, et al. Survey of multimodal data fusion[J]. Computer Engineering and Applications, 2021, 57(18): 49-64.
[59] NIE L, WANG W, HONG R, et al. Multimodal dialog system: generating responses via adaptive decoders[C]//Proceedings of the 27th ACM International Conference on Multimedia. New York: ACM, 2019: 1098-1106.
[60] RICHARD A, LEA C, MA S, et al. Audio-and gaze-driven facial animation of codec avatars[C]//Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2021: 41-50.
[61] TSAI Y H H, LIANG P P, ZADEH A, et al. Learning factorized multimodal representations[EB/OL]. [2024-01-06]. https://arxiv.org/abs/1806.06176.
[62] SHI Y, PAIGE B, TORR P. Variational mixture-of-experts autoencoders for multi-modal deep generative models[C]//Advances in Neural Information Processing Systems 32, Vancouver, Dec 8-14, 2019: 15692-15703.
[63] CHEN B, ROUDITCHENKO A, DUARTE K, et al. Multimodal clustering networks for self-supervised learning from unlabeled videos[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 8012-8021.
[64] WANG L, QIAO Y, TANG X. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2015: 4305-4314.
[65] KARPATHY A, FEI-FEI L. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2015: 3128-3137.
[66] SRIVASTAVA N, SALAKHUTDINOV R R. Multimodal learning with deep Boltzmann machines[C]//Advances in Neural Information Processing Systems 25, Lake Tahoe, Dec 3-6, 2012: 2231-2239.
[67] TAPASWI M, B?UML M, STIEFELHAGEN R. Aligning plot synopses to videos for story-based retrieval[J]. International Journal of Multimedia Information Retrieval, 2015, 4: 3-16.
[68] TAPASWI M, BAUML M, STIEFELHAGEN R. Book2movie: aligning video scenes with book chapters[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2015: 1827-1835.
[69] HUDDAR M G, SANNAKKI S S, RAJPUROHIT V S. Attention-based word-level contextual feature extraction and cross-modality fusion for sentiment analysis and emotion classification[J]. International Journal of Intelligent Engineering Informatics, 2020, 8(1): 1-18.
[70] LI H, DING W, WU Z, et al. Learning fine-grained cross modality excitement for speech emotion recognition[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2010.12733.
[71] CHEN B, CAO Q, HOU M, et al. Multimodal emotion recognition with temporal and semantic consistency[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3592-3603.
[72] LI K, CHEN C, CAO W, et al. DeAF: a multimodal deep learning framework for disease prediction[J]. Computers in Biology and Medicine, 2023, 156: 106715.
[73] SUN Q, LIN X, ZHANG Y, et al. Towards higher-order topological consistency for unsupervised network alignment[C]//Proceedings of the 2023 IEEE 39th International Conference on Data Engineering. Piscataway: IEEE, 2023: 177-190.
[74] LIU D, ZHANG D, SONG Y, et al. PDAM: a panoptic-level feature alignment framework for unsupervised domain adaptive instance segmentation in microscopy images[J]. IEEE Transactions on Medical Imaging, 2020, 40(1): 154-165.
[75] ZHU D, SUN Y, DU H, et al. HUNA: a method of hierarchical unsupervised network alignment for IoT[J]. IEEE Internet of Things Journal, 2020, 8(5): 3201-3210.
[76] ZHANG L, SHEN J, ZHANG J, et al. Multimodal marketing intent analysis for effective targeted advertising[J]. IEEE Transactions on Multimedia, 2021, 24: 1830-1843.
[77] QU L, LIU M, CAO D, et al. Context-aware multi-view summarization network for image-text matching[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1047-1055.
[78] MESSINA N, AMATO G, ESULI A, et al. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2021, 17(4): 1-23.
[79] LIU P, LI K, MENG H, et al. Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition[J]. Neurocomputing, 2022, 496: 46-55.
[80] XUE F, LI Y, LIU D, et al. LipFormer: learning to lipread unseen speakers based on visual-landmark transformers[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 4507-4517.
[81] XIA H, LAN R, LI H, et al. ST-VQA: shrinkage transformer with accurate alignment for visual question answering[J]. Applied Intelligence, 2023, 53(18): 20967-20978.
[82] CHEN Y, YUAN J, TIAN Y, et al. Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 15095-15104.
[83] 梁斌, 刘全, 徐进, 等. 基于多注意力卷积神经网络的特定目标情感分析[J]. 计算机研究与发展, 2017, 54(8): 1724-1735.
LIANG B, LIU Q, XU J, et al. Aspect-based sentiment analysis based on multi-attention CNN[J]. Journal of Computer Research and Development, 2017, 54(8): 1724-1735.
[84] CHENG Q, TAN Z, WEN K, et al. Semantic pre-alignment and ranking learning with unified framework for cross-modal retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022. DOI: 10.1109/TCSVT.2022.3182549.
[85] LIAO L, YANG M, ZHANG B. Deep supervised dual cycle adversarial network for cross-modal retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33(2): 920-934.
[86] LIAO W, HU K, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern recognition. Piscataway: IEEE, 2022: 18187-18196.
[87] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 740-755.
[88] ZHAO D, CHANG Z, GUO S. A multimodal fusion approach for image captioning[J]. Neurocomputing, 2019, 329: 476-485.
[89] BIBI M, ABBASI W A, AZIZ W, et al. A novel unsupervised ensemble framework using concept-based linguistic methods and machine learning for Twitter sentiment analysis[J]. Pattern Recognition Letters, 2022, 158: 80-86.
[90] HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6700-6709.
[91] GAO Y, CAO Y, KOU T, et al. VDPVE: VQA dataset for perceptual video enhancement[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 1474-1483.
[92] YANG G, ZHANG Z, LIU X. Visual question answering model based on fusing global-local feature[C]//Proceedings of the 3rd International Conference on Computer Vision and Pattern Analysis, Hangzhou, Mar 31-Apr 2, 2023: 6-11.
[93] YANG Z, XIANG J, YOU J, et al. Event-oriented visual question answering: the E-VQA dataset and benchmark[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(10): 10210-10223.
[94] ZHANG Z, ZHANG Y, ZHANG Y, et al. Vital information is only worth one thumbnail: towards efficient human pose estimation[J]. Pattern Recognition, 2024, 147: 110111.
[95] CHENA Y, LIUA J, YANG Z, et al. Active mining sample pair semantics for image-text matching[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2311.05425.
[96] CHUA T S, TANG J, HONG R, et al. NUS-WIDE: a real-world web image database from National University of Singapore[C]//Proceedings of the 2009 ACM International Conference on Image and Video Retrieval. New York: ACM, 2009: 1-9.
[97] GUPTA A, NARAYAN S, KHAN S, et al. Generative multi-label zero-shot learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(12): 14611-14624.
[98] HU X, GAN Z, WANG J, et al. Scaling up vision-language pre-training for image captioning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 17980-17989.
[99] CHEN J, GUO H, YI K, et al. VisualGPT: data-efficient adaptation of pretrained language models for image captioning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 18030-18040.
[100] HUO Y, ZHANG M, LIU G, et al. WenLan: bridging vision and language by large-scale multi-modal pre-training[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2103.06561.
[101] TUYEN N T V, GEORGESCU A L, DI GIULIO I, et al. A multimodal dataset for robot learning to imitate social human-human interaction[C]//Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. New York: ACM, 2023: 238-242.
[102] VILCHIS C, GONZALEZ-MENDOZA M, CHANG L, et al. A study of the frameworks for digital humans: analyzing facial tracking evolution and new research directions with AI[C]//Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Feb 6-8, 2022: 154-162.
[103] DAMEN D, DOUGHTY H, FARINELLA G M, et al. The EPIC-Kitchens dataset: collection, challenges and baselines[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(11): 4125-4141.
[104] HUANG Z, QING Z, WANG X, et al. Towards training stronger video vision transformers for EPIC-Kitchens-100 action recognition[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2106.05058.
[105] WANG J, GE Y, CAI G, et al. Object-aware video-language pre-training for retrieval[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 3313-3322.
[106] SHI Y, LIU H, XU H, et al. Learning semantics-grounded vocabulary representation for video-text retrieval[C]//Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM, 2023: 4460-4470.
[107] HUANG P Y, PATRICK M, HU J, et al. Multilingual multi-modal pre-training for zero-shot cross-lingual transfer of vision-language models[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2103.08849.
[108] HAN T, XIE W, ZISSERMAN A. Temporal alignment networks for long-term video[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 2906-2916.
[109] CHEN S, LI H, WANG Q, et al. VAST: a vision-audio-subtitle-text omni-modality foundation model and dataset[C]//Advances in Neural Information Processing Systems 36, New Orleans, Dec 10-16, 2023.
[110] YANG Z， FANG Y， ZHU C， et al. i-Code： an integrative and composable multimodal learning framework[C]//Proceedings of the 2023 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2023: 10880-10890.
[111] GAONKAR A, CHUKKAPALLI Y, RAMAN P J, et al. A comprehensive survey on multimodal data representation and information fusion algorithms[C]//Proceedings of the 2021 International Conference on Intelligent Technologies. Piscataway: IEEE, 2021: 1-8.
[112] TORRIE S, SUMSION A, SUN Z, et al. Automated dataset collection pipeline for lip motion authentication[J]. Electronic Imaging, 2023, 35(5).
[113] RADMAN A, SALLAM A, SUANDI S A. Deep residual network for face sketch synthesis[J]. Expert Systems with Applications, 2022, 190: 115980.
[114] FONSECA E, FAVORY X, PONS J, et al. Fsd50k: an open dataset of human-labeled sound events[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 829-852.
[115] CHONG D, WANG H, ZHOU P, et al. Masked spectrogram prediction for self-supervised audio pre-training[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5.
[116] WANG M, CHEN J, ZHANG X L, et al. End-to-end multi-modal speech recognition on an air and bone conducted speech corpus[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 31: 513-524.
[117] ZHOU X, WANG J, CUI Z, et al. MMSpeech: multi-modal multi-task encoder-decoder pre-training for speech recognition[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2212.00500.
[118] ZENG Y, LI Z, CHEN Z, et al. A feature-based restoration dynamic interaction network for multimodal sentiment analysis[J]. Engineering Applications of Artificial Intelligence, 2024, 127: 107335.
[119] KIM K, PARK S. AOBERT: all-modalities-in-one BERT for multimodal sentiment analysis[J]. Information Fusion, 2023, 92: 37-45.
[120] ZHANG L, LIU C, JIA N. Uni2mul: a conformer-based multimodal emotion classification model by considering unimodal expression differences with multi-task learning[J]. Applied Sciences, 2023, 13(17): 9910.
[121] REN M, HUANG X, LIU J, et al. MALN: multimodal adversarial learning network for conversational emotion recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(11): 6965-6980.
[122] SEO S B, NAM H, DELGOSHA P. MM-GATBT: enriching multimodal representation using graph attention network[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. Stroudsburg: ACL, 2022: 106-112.
[123] LIU N, WEI K, SUN X, et al. Assist non-native viewers: multimodal cross-lingual summarization for how2 videos[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2022: 6959-6969.
[124] LIU N, SUN X, YU H, et al. Abstractive summarization for video: a revisit in multistage fusion network with forget gate[J]. IEEE Transactions on Multimedia, 2023, 25: 3296-3310.