Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (10): 2501-2520.DOI: 10.3778/j.issn.1673-9418.2403083
• Frontiers·Surveys • Previous Articles Next Articles
ZHANG Hucheng, LI Leixiao, LIU Dongjiang
Online:
2024-10-01
Published:
2024-09-29
张虎成,李雷孝,刘东江
ZHANG Hucheng, LI Leixiao, LIU Dongjiang. Survey of Multimodal Data Fusion Research[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(10): 2501-2520.
张虎成, 李雷孝, 刘东江. 多模态数据融合研究综述[J]. 计算机科学与探索, 2024, 18(10): 2501-2520.
Add to citation manager EndNote|Ris|BibTeX
URL: http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2403083
[1] DING M, YANG Z, HONG W, et al. CogView: mastering text-to-image generation via transformers[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 19822-19835. [2] LIU S, FAN H, QIAN S, et al. HiT: hierarchical transformer with momentum contrast for video-text retrieval[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 11915-11925. [3] MA L, LU Z, LI H. Learning to answer questions from image using convolutional neural network[C]//Proceedings of the 2016 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2016: 3567-3573. [4] MARGE M, ESPY-WILSON C, WARD N G, et al. Spoken language interaction with robots: recommendations for future research[J]. Computer Speech & Language, 2022, 71: 101255. [5] LIANG P P, LYU Y, FAN X, et al. MultiBench: multiscale benchmarks for multimodal representation learning[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2107.07502. [6] HUANG Y, DU C, XUE Z, et al. What makes multi-modal learning better than single (provably)[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 10944-10956. [7] KARLE P, FENT F, HUCH S, et al. Multi-modal sensor fusion and object tracking for autonomous racing[J]. IEEE Transactions on Intelligent Vehicles, 2023, 8(7): 3871-3883. [8] XIE J, WANG J, WANG Q, et al. A multimodal fusion emotion recognition method based on multitask learning and attention mechanism[J]. Neurocomputing, 2023, 556: 126649. [9] XU P, ZHU X, CLIFTON D A. Multimodal learning with transformers: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10): 12113-12132. [10] LIANG P P, MORENCY L P. Tutorial on multimodal machine learning: principles, challenges, and open questions[C]//Proceedings of the 2023 International Conference on Multimodal Interaction, Paris, Oct 9-13, 2023. New York: ACM,2023: 101-104. [11] BALTRU?AITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern analysis and Machine Intelligence, 2018, 41(2): 423-443. [12] LIU K, FENG G, JIANG X, et al. A feature fusion method for driving fatigue of shield machine drivers based on multiple physiological signals and auto-encoder[J]. Sustainability, 2023, 15(12): 9405. [13] 王梓衡, 沈继锋, 左欣,等. 基于特征级与决策级融合的农作物叶片病害识别[J]. 江苏大学学报(自然科学版), 2024, 45(3): 286-294. WANG Z H, SHEN J F, ZUO X, et al. Crop leaf disease recognition based on feature-level and decision-level fusion[J]. Journal of Jiangsu University (Natural Science Edition), 2024, 45(3): 286-294. [14] HE C, XU P, PEI X, et al. Fatigue at the wheel: a non-visual approach to truck driver fatigue detection by multi-feature fusion[J]. Accident Analysis & Prevention, 2024, 199: 107511. [15] ZHANG N, WU H, ZHU H, et al. Tomato disease classification and identification method based on multimodal fusion deep learning[J]. Agriculture, 2022, 12(12): 2014. [16] YEH Y R, LIN T C, CHUNG Y Y, et al. A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection[J]. IEEE Transactions on Multimedia, 2012, 14(3): 563-574. [17] WANG M, SHAO W, HUANG S, et al. Hypergraph-regularized multimodal learning by graph diffusion for imaging genetics based Alzheimer??s disease diagnosis[J]. Medical Image Analysis, 2023, 89: 102883. [18] MCFEE B, LANCKRIET G, JEBARA T. Learning multi-modal similarity[J]. Journal of Machine Learning Research, 2011, 12(2): 491-523. [19] 陈辉, 王硕, 许家昌, 等. 基于多尺度特征融合生成对抗网络的水下图像增强[J]. 计算机工程与应用, 2023, 59(21): 231-241. CHEN H, WANG S, XU J C, et al. Underwater image enhancement based on generate adversarial network with multiscale feature fusion[J]. Computer Engineering and Applications, 2023, 59(21): 231-241. [20] ZHAO Z, BAI H, ZHU Y, et al. DDFM: denoising diffusion model for multi-modality image fusion[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 8082-8093. [21] LI Y, QI T, MA Z, et al. Seeking a hierarchical prototype for multimodal gesture recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023. DOI: 10.1109/TNNLS. 2023.3295811. [22] ZHANG H, KOH J Y, BALDRIDGE J, et al. Cross-modal contrastive learning for text-to-image generation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 833-842. [23] ZHANG X, HE L, CHEN J, et al. Multiattention mechanism 3D object detection algorithm based on RGB and LiDAR fusion for intelligent driving[J]. Sensors, 2023, 23(21): 8732. [24] CHEN J, HU Y, LAI Q, et al. IIFDD: intra and inter-modal fusion for depression detection with multi-modal information from Internet of medical things[J]. Information Fusion, 2024, 102: 102017. [25] SINGH A, HU R, GOSWAMI V, et al. FLAVA: a foundational language and vision alignment model[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 15638-15650. [26] LI J, LI D, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 2023 International Conference on Machine Learning, Honolulu, Jul 23-29, 2023: 19730-19742. [27] LIANG S, ZHAO M, SCHüTZE H. Modular and parameter-efficient multimodal fusion with prompting[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2203.08055. [28] KUMARI R, EKBAL A. AMFB: attention based multimodal factorized bilinear pooling for multimodal fake news detection[J]. Expert Systems with Applications, 2021, 184: 115412. [29] XU Q, MEI Y, LIU J, et al. Multimodal cross-layer bilinear pooling for RGBT tracking[J]. IEEE Transactions on Multimedia, 2021, 24: 567-580. [30] GOEL T, SHARMA R, TANVEER M, et al. Multimodal neuroimaging based Alzheimer??s disease diagnosis using evolu-tionary RVFL classifier[J]. IEEE Journal of Biomedical and Health Informatics, 2023, 6: 1-9. [31] HAN G, WANG M, ZHU H, et al. UIEGAN: adversarial learning-based photo-realistic image enhancement for intelligent underwater environment perception[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5611514. [32] ZHAO C, YANG P, ZHOU F, et al. MHW-GAN: multi-discriminator hierarchical wavelet generative adversarial network for multimodal image fusion[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023. DOI: 10.1109/TNNLS.2023.3271059. [33] ELAKKIYA R, VIJAYAKUMAR P, KUMAR N. An optimized generative adversarial network based continuous sign language classification[J]. Expert Systems with Applications, 2021, 182: 115276. [34] YANG B, XIANG X, KONG W, et al. DMF-GAN: deep multimodal fusion generative adversarial networks for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 6956-6967. [35] FAN H, ZHANG X, XU Y, et al. Transformer-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals[J]. Information Fusion, 2024, 104: 102161. [36] LIU Y, BING W, REN S, et al. BC-FND: an approach based on hierarchical bilinear fusion and multimodal consistency for fake news detection[J]. IEEE Access, 2024, 12: 62738-62749. [37] KANG B, LIANG D, MEI J, et al. Robust RGB-T tracking via graph attention-based bilinear pooling[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(12): 9900-9911. [38] DAS R, SINGH T D. Image-text multimodal sentiment analysis framework of assamese news articles using late fusion[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2023, 22(6): 1-30. [39] 黄忠, 胡敏, 刘娟. 基于多特征决策级融合的表情识别方法[J]. 计算机工程, 2015, 41(10): 171-176. HUANG Z, HU M, LIU J. Facial expression recognition method based on multi-feature decision-level fusion[J]. Computer Engineering, 2015, 41(10): 171-176. [40] 宁大海, 郑晟. 可见光和红外图像决策级融合目标检测算法[J]. 红外技术, 2023, 45(3): 282-291. NING D H, ZHENG S. An object detection algorithm based on decision-level fusion of visible and infrared images[J]. Infrared Technology, 2023, 45(3): 282-291. [41] HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 International Conference on Multimodal Interaction, Montréal, Oct 18-22, 2021. New York: ACM, 2021: 6-15. [42] YANG B, WU L, ZHU J, et al. Multimodal sentiment analysis with two-phase multi-task learning[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 2015-2024. [43] PENG Y, QI J, YUAN Y. Modality-specific cross-modal similarity measurement with recurrent attention network[J]. IEEE Transactions on Image Processing, 2018, 27(11): 5585-5599. [44] RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia. New York: ACM, 2010: 251-260. [45] FROME A, CORRADO G S, SHLENS J. DeViSE: a deep visual-semantic embedding model[C]//Advances?in?Neural?Information?Processing?Systems?26, Lake?Tahoe, Dec?5-8, 2013: 2121-2129. [46] MEKHALDI D. Multimodal document alignment: towards a fully-indexed multimedia archive[C]//Proceedings of the 2007 Multimedia Information Retrieval Workshop, Amsterdam, Jul 23-27, 2007. [47] WEHRMANN J, MATTJIE A, BARROS R C. Order embeddings and character-level convolutions for multimodal alignment[J]. Pattern Recognition Letters, 2018, 102: 15-22. [48] SONG G, WANG S, TIAN Q. Fusing feature and similarity for multimodal search[C]//Proceedings of the 2015 IEEE China Summit and International Conference on Signal and Information Processing. Piscataway: IEEE, 2015: 787-791. [49] HU D, NIE F, LI X. Deep multimodal clustering for unsupervised audiovisual learning[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 9248-9257. [50] LIANG P P, ZADEH A, MORENCY L P. Foundations & trends in multimodal machine learning: principles, challenges, and open questions[J]. ACM Computing Surveys, 2024, 56(10): 264. [51] VENDROV I, KIROS R, FIDLER S, et al. Order-embeddings of images and language[EB/OL]. [2024-01-06]. https://arxiv. org/abs/1511.06361. [52] YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. [53] MAJUMDER N, HAZARIKA D, GELBUKH A, et al. Multi-modal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124-133. [54] QU Z, WANG C Y, WANG S Y, et al. A method of hierarchical feature fusion and connected attention architecture for pavement crack detection[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(9): 16038-16047. [55] LI R, WU X, LI A, et al. HFBSurv: hierarchical multimodal fusion with factorized bilinear models for cancer survival prediction[J]. Bioinformatics, 2022, 38(9): 2587-2594. [56] WANG J, YANG S, ZHAO H. Crisis event summary generative model based on hierarchical multimodal fusion[J]. Pattern Recognition, 2023, 144: 109890. [57] ZHOU W, DONG S, LEI J, et al. MTANet: multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding[J]. IEEE Transactions on Intelligent Vehicles, 2022, 8(1): 48-58. [58] 任泽裕, 王振超, 柯尊旺, 等. 多模态数据融合综述[J]. 计算机工程与应用, 2021, 57(18): 49-64. REN Z Y, WANG Z C, KE Z W, et al. Survey of multimodal data fusion[J]. Computer Engineering and Applications, 2021, 57(18): 49-64. [59] NIE L, WANG W, HONG R, et al. Multimodal dialog system: generating responses via adaptive decoders[C]//Proceedings of the 27th ACM International Conference on Multimedia. New York: ACM, 2019: 1098-1106. [60] RICHARD A, LEA C, MA S, et al. Audio-and gaze-driven facial animation of codec avatars[C]//Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2021: 41-50. [61] TSAI Y H H, LIANG P P, ZADEH A, et al. Learning factorized multimodal representations[EB/OL]. [2024-01-06]. https://arxiv.org/abs/1806.06176. [62] SHI Y, PAIGE B, TORR P. Variational mixture-of-experts autoencoders for multi-modal deep generative models[C]//Advances in Neural Information Processing Systems 32, Vancouver, Dec 8-14, 2019: 15692-15703. [63] CHEN B, ROUDITCHENKO A, DUARTE K, et al. Multimodal clustering networks for self-supervised learning from unlabeled videos[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 8012-8021. [64] WANG L, QIAO Y, TANG X. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2015: 4305-4314. [65] KARPATHY A, FEI-FEI L. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2015: 3128-3137. [66] SRIVASTAVA N, SALAKHUTDINOV R R. Multimodal learning with deep Boltzmann machines[C]//Advances in Neural Information Processing Systems 25, Lake Tahoe, Dec 3-6, 2012: 2231-2239. [67] TAPASWI M, B?UML M, STIEFELHAGEN R. Aligning plot synopses to videos for story-based retrieval[J]. International Journal of Multimedia Information Retrieval, 2015, 4: 3-16. [68] TAPASWI M, BAUML M, STIEFELHAGEN R. Book2movie: aligning video scenes with book chapters[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2015: 1827-1835. [69] HUDDAR M G, SANNAKKI S S, RAJPUROHIT V S. Attention-based word-level contextual feature extraction and cross-modality fusion for sentiment analysis and emotion classification[J]. International Journal of Intelligent Engineering Informatics, 2020, 8(1): 1-18. [70] LI H, DING W, WU Z, et al. Learning fine-grained cross modality excitement for speech emotion recognition[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2010.12733. [71] CHEN B, CAO Q, HOU M, et al. Multimodal emotion recognition with temporal and semantic consistency[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3592-3603. [72] LI K, CHEN C, CAO W, et al. DeAF: a multimodal deep learning framework for disease prediction[J]. Computers in Biology and Medicine, 2023, 156: 106715. [73] SUN Q, LIN X, ZHANG Y, et al. Towards higher-order topological consistency for unsupervised network alignment[C]//Proceedings of the 2023 IEEE 39th International Conference on Data Engineering. Piscataway: IEEE, 2023: 177-190. [74] LIU D, ZHANG D, SONG Y, et al. PDAM: a panoptic-level feature alignment framework for unsupervised domain adaptive instance segmentation in microscopy images[J]. IEEE Transactions on Medical Imaging, 2020, 40(1): 154-165. [75] ZHU D, SUN Y, DU H, et al. HUNA: a method of hierarchical unsupervised network alignment for IoT[J]. IEEE Internet of Things Journal, 2020, 8(5): 3201-3210. [76] ZHANG L, SHEN J, ZHANG J, et al. Multimodal marketing intent analysis for effective targeted advertising[J]. IEEE Transactions on Multimedia, 2021, 24: 1830-1843. [77] QU L, LIU M, CAO D, et al. Context-aware multi-view summarization network for image-text matching[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1047-1055. [78] MESSINA N, AMATO G, ESULI A, et al. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2021, 17(4): 1-23. [79] LIU P, LI K, MENG H, et al. Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition[J]. Neurocomputing, 2022, 496: 46-55. [80] XUE F, LI Y, LIU D, et al. LipFormer: learning to lipread unseen speakers based on visual-landmark transformers[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 4507-4517. [81] XIA H, LAN R, LI H, et al. ST-VQA: shrinkage transformer with accurate alignment for visual question answering[J]. Applied Intelligence, 2023, 53(18): 20967-20978. [82] CHEN Y, YUAN J, TIAN Y, et al. Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 15095-15104. [83] 梁斌, 刘全, 徐进, 等. 基于多注意力卷积神经网络的特定目标情感分析[J]. 计算机研究与发展, 2017, 54(8): 1724-1735. LIANG B, LIU Q, XU J, et al. Aspect-based sentiment analysis based on multi-attention CNN[J]. Journal of Computer Research and Development, 2017, 54(8): 1724-1735. [84] CHENG Q, TAN Z, WEN K, et al. Semantic pre-alignment and ranking learning with unified framework for cross-modal retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022. DOI: 10.1109/TCSVT.2022.3182549. [85] LIAO L, YANG M, ZHANG B. Deep supervised dual cycle adversarial network for cross-modal retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33(2): 920-934. [86] LIAO W, HU K, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern recognition. Piscataway: IEEE, 2022: 18187-18196. [87] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 740-755. [88] ZHAO D, CHANG Z, GUO S. A multimodal fusion approach for image captioning[J]. Neurocomputing, 2019, 329: 476-485. [89] BIBI M, ABBASI W A, AZIZ W, et al. A novel unsupervised ensemble framework using concept-based linguistic methods and machine learning for Twitter sentiment analysis[J]. Pattern Recognition Letters, 2022, 158: 80-86. [90] HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6700-6709. [91] GAO Y, CAO Y, KOU T, et al. VDPVE: VQA dataset for perceptual video enhancement[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 1474-1483. [92] YANG G, ZHANG Z, LIU X. Visual question answering model based on fusing global-local feature[C]//Proceedings of the 3rd International Conference on Computer Vision and Pattern Analysis, Hangzhou, Mar 31-Apr 2, 2023: 6-11. [93] YANG Z, XIANG J, YOU J, et al. Event-oriented visual question answering: the E-VQA dataset and benchmark[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(10): 10210-10223. [94] ZHANG Z, ZHANG Y, ZHANG Y, et al. Vital information is only worth one thumbnail: towards efficient human pose estimation[J]. Pattern Recognition, 2024, 147: 110111. [95] CHENA Y, LIUA J, YANG Z, et al. Active mining sample pair semantics for image-text matching[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2311.05425. [96] CHUA T S, TANG J, HONG R, et al. NUS-WIDE: a real-world web image database from National University of Singapore[C]//Proceedings of the 2009 ACM International Conference on Image and Video Retrieval. New York: ACM, 2009: 1-9. [97] GUPTA A, NARAYAN S, KHAN S, et al. Generative multi-label zero-shot learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(12): 14611-14624. [98] HU X, GAN Z, WANG J, et al. Scaling up vision-language pre-training for image captioning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 17980-17989. [99] CHEN J, GUO H, YI K, et al. VisualGPT: data-efficient adaptation of pretrained language models for image captioning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 18030-18040. [100] HUO Y, ZHANG M, LIU G, et al. WenLan: bridging vision and language by large-scale multi-modal pre-training[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2103.06561. [101] TUYEN N T V, GEORGESCU A L, DI GIULIO I, et al. A multimodal dataset for robot learning to imitate social human-human interaction[C]//Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. New York: ACM, 2023: 238-242. [102] VILCHIS C, GONZALEZ-MENDOZA M, CHANG L, et al. A study of the frameworks for digital humans: analyzing facial tracking evolution and new research directions with AI[C]//Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Feb 6-8, 2022: 154-162. [103] DAMEN D, DOUGHTY H, FARINELLA G M, et al. The EPIC-Kitchens dataset: collection, challenges and baselines[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(11): 4125-4141. [104] HUANG Z, QING Z, WANG X, et al. Towards training stronger video vision transformers for EPIC-Kitchens-100 action recognition[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2106.05058. [105] WANG J, GE Y, CAI G, et al. Object-aware video-language pre-training for retrieval[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 3313-3322. [106] SHI Y, LIU H, XU H, et al. Learning semantics-grounded vocabulary representation for video-text retrieval[C]//Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM, 2023: 4460-4470. [107] HUANG P Y, PATRICK M, HU J, et al. Multilingual multi-modal pre-training for zero-shot cross-lingual transfer of vision-language models[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2103.08849. [108] HAN T, XIE W, ZISSERMAN A. Temporal alignment networks for long-term video[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 2906-2916. [109] CHEN S, LI H, WANG Q, et al. VAST: a vision-audio-subtitle-text omni-modality foundation model and dataset[C]//Advances in Neural Information Processing Systems 36, New Orleans, Dec 10-16, 2023. [110] YANG Z, FANG Y, ZHU C, et al. i-Code: an integrative and composable multimodal learning framework[C]//Proceedings of the 2023 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2023: 10880-10890. [111] GAONKAR A, CHUKKAPALLI Y, RAMAN P J, et al. A comprehensive survey on multimodal data representation and information fusion algorithms[C]//Proceedings of the 2021 International Conference on Intelligent Technologies. Piscataway: IEEE, 2021: 1-8. [112] TORRIE S, SUMSION A, SUN Z, et al. Automated dataset collection pipeline for lip motion authentication[J]. Electronic Imaging, 2023, 35(5). [113] RADMAN A, SALLAM A, SUANDI S A. Deep residual network for face sketch synthesis[J]. Expert Systems with Applications, 2022, 190: 115980. [114] FONSECA E, FAVORY X, PONS J, et al. Fsd50k: an open dataset of human-labeled sound events[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 829-852. [115] CHONG D, WANG H, ZHOU P, et al. Masked spectrogram prediction for self-supervised audio pre-training[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5. [116] WANG M, CHEN J, ZHANG X L, et al. End-to-end multi-modal speech recognition on an air and bone conducted speech corpus[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 31: 513-524. [117] ZHOU X, WANG J, CUI Z, et al. MMSpeech: multi-modal multi-task encoder-decoder pre-training for speech recognition[EB/OL]. [2024-01-06]. https://arxiv.org/abs/2212.00500. [118] ZENG Y, LI Z, CHEN Z, et al. A feature-based restoration dynamic interaction network for multimodal sentiment analysis[J]. Engineering Applications of Artificial Intelligence, 2024, 127: 107335. [119] KIM K, PARK S. AOBERT: all-modalities-in-one BERT for multimodal sentiment analysis[J]. Information Fusion, 2023, 92: 37-45. [120] ZHANG L, LIU C, JIA N. Uni2mul: a conformer-based multimodal emotion classification model by considering unimodal expression differences with multi-task learning[J]. Applied Sciences, 2023, 13(17): 9910. [121] REN M, HUANG X, LIU J, et al. MALN: multimodal adversarial learning network for conversational emotion recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(11): 6965-6980. [122] SEO S B, NAM H, DELGOSHA P. MM-GATBT: enriching multimodal representation using graph attention network[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. Stroudsburg: ACL, 2022: 106-112. [123] LIU N, WEI K, SUN X, et al. Assist non-native viewers: multimodal cross-lingual summarization for how2 videos[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2022: 6959-6969. [124] LIU N, SUN X, YU H, et al. Abstractive summarization for video: a revisit in multistage fusion network with forget gate[J]. IEEE Transactions on Multimedia, 2023, 25: 3296-3310. |
[1] | LI Ziqi, SU Yuxuan, SUN Jun, ZHANG Yonghong, XIA Qingfeng, YIN Hefeng. Critical Review of Multi-focus Image Fusion Based on Deep Learning Method [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2276-2292. |
[2] | LIAN Zhe, YIN Yanjun, ZHI Min, XU Qiaozhi. Review of Differentiable Binarization Techniques for Text Detection in Natural Scenes [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2239-2260. |
[3] | FANG Boru, QIU Dawei, BAI Yang, LIU Jing. Review of Application of Surface Electromyography Signals in Muscle Fatigue Research [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2261-2275. |
[4] | WANG Yousong, PEI Junpeng, LI Zenghui, WANG Wei. Review of Research on Deep Learning in Retinal Blood Vessel Segmentation [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 1960-1978. |
[5] | YE Qingwen, ZHANG Qiuju. Multi-label Image Recognition Using Channel Pixel Attention [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 2109-2117. |
[6] | HOU Xin, WANG Yan, WANG Xuan, FAN Wei. Review of Application Progress of Panoramic Imagery in Urban Research [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(7): 1661-1682. |
[7] | HAN Han, HUANG Xunhua, CHANG Huihui, FAN Haoyi, CHEN Peng, CHEN Jijia. Review of Self-supervised Learning Methods in Field of ECG [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(7): 1683-1704. |
[8] | LI Jiancheng, CAO Lu, HE Xiquan, LIAO Junhong. Review of Classification Methods for Lung Nodules in CT Images [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(7): 1705-1724. |
[9] | PU Qiumei, YIN Shuai, LI Zhengmao, ZHAO Lina. Review of U-Net-Based Convolutional Neural Networks for Breast Medical Image Segmentation [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(6): 1383-1403. |
[10] | JIANG Jian, ZHANG Qi, WANG Caiyong. Review of Deep Learning Based Iris Recognition [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(6): 1421-1437. |
[11] | ZHANG Kaili, WANG Anzhi, XIONG Yawei, LIU Yun. Survey of Transformer-Based Single Image Dehazing Methods [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(5): 1182-1196. |
[12] | ZENG Fanzhi, FENG Wenjie, ZHOU Yan. Survey on Natural Scene Text Recognition Methods of Deep Learning [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(5): 1160-1181. |
[13] | YU Fan, ZHANG Jing. Dense Pedestrian Detection Based on Shifted Window Attention Multi-scale Equalization [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(5): 1286-1300. |
[14] | SUN Shuifa, TANG Yongheng, WANG Ben, DONG Fangmin, LI Xiaolong, CAI Jiacheng, WU Yirong. Review of Research on 3D Reconstruction of Dynamic Scenes [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(4): 831-860. |
[15] | WANG Enlong, LI Jiawei, LEI Jia, ZHOU Shihua. Deep Learning-Based Infrared and Visible Image Fusion: A Survey [J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(4): 899-915. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/