
Journal of Frontiers of Computer Science and Technology ›› 2025, Vol. 19 ›› Issue (7): 1729-1746.DOI: 10.3778/j.issn.1673-9418.2411008
• Frontiers·Surveys • Previous Articles Next Articles
SHI Zhenpu, LYU Xiao, DONG Yanru, LIU Jing, WANG Xiaoyan
Online:2025-07-01
Published:2025-06-30
时振普,吕潇,董彦如,刘静,王晓燕
SHI Zhenpu, LYU Xiao, DONG Yanru, LIU Jing, WANG Xiaoyan. Research on Development Status of Multimodal Knowledge Graph Fusion Technology in Medical Field[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(7): 1729-1746.
时振普, 吕潇, 董彦如, 刘静, 王晓燕. 医学领域多模态知识图谱融合技术发展现状研究[J]. 计算机科学与探索, 2025, 19(7): 1729-1746.
Add to citation manager EndNote|Ris|BibTeX
URL: http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2411008
| [1] SINGHAL A. Introducing the knowledge graph: things, not strings[J]. Official Google Blog, 2012, 5(16): 3. [2] YANG F, NING B, LI H Q. An overview of multimodal fusion learning[C]//Proceedings of the 2022 International Conference on Mobile Computing, Applications, and Services. Cham: Springer, 2022: 259-268. [3] 陈囿任, 李勇, 温明, 等. 多模态知识图谱融合技术研究综述[J]. 计算机工程与应用, 2024, 60(13): 36-50. CHEN Y R, LI Y, WEN M, et al. Research and comprehensive review on multi-modal knowledge graph fusion techniques[J]. Computer Engineering and Applications, 2024, 60(13): 36-50. [4] KUMAR S, RANI S, SHARMA S, et al. Multimodality fusion aspects of medical diagnosis: a comprehensive review[J]. Bioengineering, 2024, 11(12): 1233. [5] SHAIK T, TAO X H, LI L, et al. A survey of multimodal information fusion for smart healthcare: mapping the journey from data to wisdom[J]. Information Fusion, 2024, 102: 102040. [6] PENG C Y, XIA F, NASERIPARSA M, et al. Knowledge graphs: opportunities and challenges[J]. Artificial Intelligence Review, 2023, 56(11): 13071-13102. [7] ZHU X R, LI Z X, WANG X D, et al. Multi-modal knowledge graph construction and application: a survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(2): 715-735. [8] CHEN Y, GE X K, YANG S L, et al. A survey on multimodal knowledge graphs: construction, completion and applications[J]. Mathematics, 2023, 11(8): 1815. [9] LIU Y, LI H, GARCIA-DURAN A, et al. MMKG: multi-modal knowledge graphs[C]//Proceedings of the 16th International Conference on the Semantic Web. Cham: Springer, 2019: 459-474. [10] LEHMANN J, ISELE R, JAKOB M, et al. DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia[J]. Semantic Web, 2015, 6(2): 167-195. [11] FERRADA S, BUSTOS B, HOGAN A. IMGpedia: a linked dataset with content-based analysis of Wikimedia images[C]//Proceedings of the 2017 International Semantic Web Conference. Cham: Springer, 2017: 84-93. [12] O?ORO-RUBIO D, NIEPERT M, GARCíA-DURáN A, et al. Answering visual-relational queries in web-extracted knowledge graphs[EB/OL]. [2024-09-16]. https://arxiv.org/abs/1709. 02314. [13] WANG M, QI G L, WANG H F, et al. Richpedia: a comprehensive multi-modal knowledge graph[C]//Proceedings of the 9th Joint International Conference on Semantic Technology. Cham: Springer, 2020: 130-145. [14] LI M L, ZAREIAN A, LIN Y, et al. GAIA: a fine-grained multimedia knowledge extraction system[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: ACL, 2020: 77-86. [15] KRISHNA R, ZHU Y K, GROTH O, et al. Visual genome: connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1): 32-73. [16] ZHANG N, LI L, CHEN X, et al. Multimodal analogical reasoning over knowledge graphs[C]//Proceedings of the 11th International Conference on Learning Representations, 2023. [17] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755. [18] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 6325-6334. [19] PLUMMER B A, WANG L W, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 2641-2649. [20] AGRAWAL H, DESAI K R, WANG Y F, et al. Nocaps: novel object captioning at scale[C]//Proceedings of the 2019 IEEE/ CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 8947-8956. [21] WANG X W, TIAN J F, GUI M, et al. WikiDiverse: a multimodal entity linking dataset with diversified contextual topics and entity types[EB/OL]. [2024-09-16]. https://arxiv.org/abs/2204.06347. [22] HU X Y, GU L, KOBAYASHI K, et al. Interpretable medical image visual question answering via multi-modal relationship graph learning[J]. Medical Image Analysis, 2024, 97: 103279. [23] SUBRAMANIAN S, WANG L L, MEHTA S, et al. MedICaT: a dataset of medical images, captions, and textual references[EB/OL]. [2024-09-16]. https://arxiv.org/abs/2010.06000. [24] IRVIN J, RAJPURKAR P, KO M, et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 590-597. [25] PAVLOPOULOS J, KOUGIA V, ANDROUTSOPOULOS I. A survey on biomedical image captioning[C]//Proceedings of the 2nd Workshop on Shortcomings in Vision and Language. Stroudsburg: ACL, 2019: 26-36. [26] JOHNSON A E W, POLLARD T J, BERKOWITZ S J, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports[J]. Scientific Data, 2019, 6: 317. [27] LAU J J, GAYEN S, BEN ABACHA A, et al. A dataset of clinically generated visual questions and answers about radiology images[J]. Scientific Data, 2018, 5: 180251. [28] 陈烨, 周刚, 卢记仓. 多模态知识图谱构建与应用研究综述[J]. 计算机应用研究, 2021, 38(12): 3535-3543. CHEN Y, ZHOU G, LU J C. Survey on construction and application research for multi-modal knowledge graphs[J]. Application Research of Computers, 2021, 38(12): 3535-3543. [29] GONG Y C, LV X Q, YUAN Z, et al. GNN-based multimodal named entity recognition[J]. The Computer Journal, 2024, 67(8): 2622-2632. [30] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9992-10002. [31] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 2261-2269. [32] WANG F, ZHOU Y, WANG S, et al. Multi-granularity cross-modal alignment for generalized medical visual representation learning[C]//Advances in Neural Information Processing Systems 35, 2022: 33536-33549. [33] RADFORD A, KIN J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 8748-8763. [34] LI J, LI D, XIONG C, et al. Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 12888-12900. [35] YAO L W, HUANG R H, HOU L, et al. FILIP: fine grained interactive language-image pre-training[C]//Proceedings of the 10th International Conference on Learning Representations, 2022. [36] WANG Z F, WU Z B, AGARWAL D, et al. MedCLIP: contrastive learning from unpaired medical images and text[EB/OL]. [2024-09-19]. https://arxiv.org/abs/2210.10163. [37] ZHANG Y, JIANG H, MIURA Y, et al. Contrastive learning of medical visual representations from paired images and text[C]//Proceedings of the Machine Learning for Healthcare Conference, 2022: 2-25. [38] MOON J H, LEE H, SHIN W, et al. Multi-modal understanding and generation for medical images and text via vision-language pre-training[J]. IEEE Journal of Biomedical and Health Informatics, 2022, 26(12): 6070-6080. [39] XIANG J X, WANG X Y, ZHANG X M, et al. A vision-language foundation model for precision oncology[J]. Nature, 2025, 638(8051): 769-778. [40] MA C, JIANG H, CHEN W, et al. Eye-gaze guided multi-modal alignment for medical representation learning[C]//Advances in Neural Information Processing Systems 37, 2024: 6126-6153. [41] LI Z P, ZOU C H, MA S X, et al. ZALM3: zero-shot enhancement of vision-language alignment via in-context information in multi-turn multimodal medical dialogue[EB/OL]. [2024-10-14]. https://arxiv.org/abs/2409.17610. [42] XIA P, ZHU K Y, LI H R, et al. MMed-RAG: versatile multimodal RAG system for medical vision language models[EB/OL]. [2024-10-30]. https://arxiv.org/abs/2410.13085. [43] ZHU K Y, XIA P, LI Y, et al. MMedPO: aligning medical vision-language models with clinical-aware multimodal preference optimization[EB/OL]. [2024-12-15]. https://arxiv.org/abs/ 2412.06141. [44] DOSOVITSKIY A. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2024-09-16]. https://arxiv.org/abs/2010.11929. [45] SHI S B, XU Z R, HU B T, et al. Generative multimodal entity linking[EB/OL]. [2024-09-16]. https://arxiv.org/abs/2306.12725. [46] LIU Q, HE Y Y, XU T, et al. UniMEL: a unified framework for multimodal entity linking with large language models[C]//Proceedings of the 33rd ACM International Conference on Information and Knowledge Management. New York: ACM, 2024: 1909-1919. [47] SHERSTINSKY A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network[J]. Physica D: Nonlinear Phenomena, 2020, 404: 132306. [48] HAN K, WANG Y H, CHEN H T, et al. A survey on vision transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 87-110. [49] KOROTEEV M V. BERT: a review of applications in natural language processing and understanding[EB/OL]. [2024-09-16]. https://arxiv.org/abs/2103.11943. [50] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[EB/OL]. [2024-09-16]. https://arxiv.org/abs/2303.08774. [51] LEE J, YOON W, KIM S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234-1240. [52] HUANG K X, ALTOSAAR J, RANGANATH R. ClinicalBERT: modeling clinical notes and predicting hospital readmission[EB/OL]. [2024-09-19]. https://arxiv.org/abs/1904.05342. [53] LIU N, HU Q, XU H Y, et al. Med-BERT: a pretraining framework for medical records named entity recognition[J]. IEEE Transactions on Industrial Informatics, 2022, 18(8): 5600-5608. [54] 胡嘉元, 邱瑞瑾, 孙杨, 等. 自然语言处理及其在医学领域的应用[J]. 中国循证医学杂志, 2024, 24(10): 1205-1211. HU J Y, QIU R J, SUN Y, et al. Natural language processing and its application in the medical field[J]. Chinese Journal of Evidence-Based Medicine, 2024, 24(10): 1205-1211. [55] 张云秋, 殷策. 基于大模型的中文电子病历实体自动识别研究[J/OL]. 数据分析与知识发现 [2024-11-19]. https://kns. cnki.net/KCMS/detail/detail.aspx?filename=41115001&dbname=CJFD&dbcode=CJFQ. ZHANG Y Q, YIN C. Research on entity automatic recognition of Chinese electronic medical records based on large model[J/OL]. Data Analysis and Knowledge Discovery [2024- 11-19]. https://kns.cnki.net/KCMS/detail/detail.aspx?filename= XDTQ20241115001&dbname=CJFD&dbcode=CJFQ. [56] GOEL A, GUETA A, GILON O, et al. LLMs accelerate annotation for medical information extraction[C]//Machine Learning for Health, ML4H@NeurIPS 2023, 2023: 82-100. [57] LI D Y, KADAV A, GAO A J, et al. Automated clinical data extraction with knowledge conditioned LLMs[EB/OL]. [2024- 09-19]. https://arxiv.org/abs/2406.18027. [58] WU Z F, SHEN C H, VAN DEN HENGEL A. Wider or deeper: revisiting the ResNet model for visual recognition[J]. Pattern Recognition, 2019, 90: 119-133. [59] YUAN Z W, ZHANG J. Feature extraction and image retrieval based on AlexNet[C]//Proceedings of the 8th International Conference on Digital Image Processing, 2016: 65-69. [60] YU W, YANG K, BAI Y, et al. Visualizing and comparing AlexNet and VGG using deconvolutional layers[C]//Proceedings of the 33rd International Conference on Machine Learning, 2016. [61] TAMMINA S. Transfer learning using VGG-16 with deep convolutional neural network for classifying images[J]. International Journal of Scientific and Research Publications, 2019, 9(10): 143-150. [62] LIANG J Z. Image classification based on RESNET[J]. Journal of Physics: Conference Series, 2020, 1634(1): 012110. [63] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144. [64] LEI M Q, WU H C, LV X H, et al. ConDSeg: a general medical image segmentation framework via contrast-driven feature enhancement[EB/OL]. [2024-12-15]. https://arxiv.org/ abs/2412.08345. [65] VAESSEN N, VAN LEEUWEN D A. Fine-tuning Wav2Vec2 for speaker recognition[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 7967-7971. [66] CHEN S Y, LIU S J, ZHOU L, et al. VALL-E 2: neural codec language models are human parity zero-shot text to speech synthesizers[EB/OL]. [2024-09-19]. https://arxiv.org/abs/2406. 05370. [67] RAJASEKAR S J S, BALARAMAN A R, BALARAMAN D V, et al. Detection of tuberculosis using cough audio analysis: a deep learning approach with capsule networks[J]. Discover Artificial Intelligence, 2024, 4(1): 77. [68] 王慧, 张玭, 金丰护, 等. 基于卷积神经网络和长短时记忆网络的心理疲劳状态识别方法[J]. 生物医学工程学杂志, 2024, 41(1): 34-40. WANG H, ZHANG P, JIN F H, et al. Mental fatigue state recognition method based on convolution neural network and long short-term memory[J]. Journal of Biomedical Engineering, 2024, 41(1): 34-40. [69] FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[EB/OL]. [2024-09-19]. https://arxiv.org/abs/1606.01847. [70] KIM J H, ON K W, LIM W, et al. Hadamard product for low-rank bilinear pooling[EB/OL]. [2024-09-19]. https://arxiv. org/abs/1610.04325. [71] YU Z, YU J, FAN J P, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 1839-1848. [72] YU Z, YU J, XIANG C C, et al. Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(12): 5947-5959. [73] SU W J, ZHU X Z, CAO Y, et al. VL-BERT: pre-training of generic visual-linguistic representations[EB/OL]. [2024-09-19]. https://arxiv.org/abs/1908.08530. [74] LI L H, YATSKAR M, YIN D, et al. VisualBERT: a simple and performant baseline for vision and language[EB/OL]. [2024-09-19]. https://arxiv.org/abs/1908.03557. [75] QI D, SU L, SONG J, et al. ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data[EB/OL]. [2024-09-19]. https://arxiv.org/abs/2001.07966. [76] CHEN Y C, LI L J, YU L C, et al. UNITER: universal image-text representation learning[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 104-120. [77] KIM W, SON B, KIM I. ViLT: vision-and-language transformer without convolution or region supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 5583-5594. [78] SUN C, MYERS A, VONDRICK C, et al. VideoBERT: a joint model for video and language representation learning[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 7463-7472. [79] LI L J, CHEN Y C, CHENG Y, et al. HERO: hierarchical encoder for video+language omni-representation pre-training [EB/OL]. [2024-09-19]. https://arxiv.org/abs/2005.00200. [80] LU J S, BATRA D, PARIKH D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[EB/OL]. [2024-09-19]. https://arxiv.org/abs/1908.02265. [81] TAN H, BANSAL M. LXMERT: learning cross-modality encoder representations from transformers[EB/OL]. [2024-09-19]. https://arxiv.org/abs/1908.07490. [82] LUO H S, JI L, SHI B T, et al. UniVL: a unified video and language pre-training model for multimodal understanding and generation[EB/OL]. [2024-09-19]. https://arxiv.org/abs/2002.06353. [83] ZHU L C, YANG Y. ActBERT: learning global-local video-text representations[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 8743-8752. [84] KIM D, KIM T. Missing modality prediction for unpaired multimodal learning via joint embedding of unimodal models [C]//Proceedings of the 18th European Conference on Computer Vision. Cham: Springer, 2024: 171-187. [85] LI W B, ZHOU H, YU J Q, et al. Coupled mamba: enhanced multi-modal fusion with coupled state space model[EB/OL]. [2024-10-14]. https://arxiv.org/abs/2405.18014. [86] GU A, DAO T. Mamba: linear-time sequence modeling with selective state spaces[EB/OL]. [2024-10-14]. https://arxiv.org/ abs/2312.00752. [87] WANG J Z, WANG K, YU Y F, et al. Self-improving generative foundation model for synthetic medical image generation and clinical applications[J]. Nature Medicine, 2024, 31(2): 609-617. [88] SHAO J, MA J C, YU Y Z, et al. A multimodal integration pipeline for accurate diagnosis, pathogen identification, and prognosis prediction of pulmonary infections[J]. The Innovation, 2024, 5(4): 100648. [89] NIU C, LYU Q, CAROTHERS C D, et al. Medical multimodal multitask foundation model for lung cancer screening[J]. Nature Communications, 2025, 16: 1523. [90] SAAB K, TU T, WENG W H, et al. Capabilities of gemini models in medicine[EB/OL]. [2024-10-14]. https://arxiv.org/ abs/2404.18416. [91] LI C, WONG C, ZHANG S, et al. LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day[C]//Advances in Neural Information Processing Systems 36, 2023. [92] WU C Y, LIN W X, ZHANG X M, et al. PMC-LLaMA: toward building open-source language models for medicine[J]. Journal of the American Medical Informatics Association, 2024, 31(9): 1833-1843. [93] ZHANG K, ZHOU R, ADHIKARLA E, et al. A generalist vision-language foundation model for diverse biomedical tasks[J]. Nature Medicine, 2024, 30(11): 3129-3141. [94] CUI H C, ZHAO Y C, XIONG S, et al. Diagnosing solid lesions in the pancreas with multimodal artificial intelligence: a randomized crossover trial[J]. JAMA Network Open, 2024, 7(7): e2422454. [95] GAO F, DING J X, GAI B W, et al. Interpretable multimodal fusion model for bridged histology and genomics survival prediction in pan-cancer[J]. Advanced Science, 2025: 2407060. [96] ZHANG X M, WU C Y, ZHAO Z H, et al. PMC-VQA: visual instruction tuning for medical visual question answering[EB/OL]. [2024-10-14]. https://arxiv.org/abs/2305.10415. [97] 张殿元, 余传明. 基于知识增强与多模态融合的医疗视觉问答模型[J]. 数据分析与知识发现, 2024, 8(S1): 226-239. ZHANG D Y, YU C M. Medical visual question answering model based on knowledge enhancement and multimodal fusion[J]. Data Analysis and Knowledge Discovery, 2024, 8(S1): 226-239. [98] 刘亚. 基于深度学习的医疗领域多模态视觉问答研究[D].临沂: 临沂大学, 2024. LIU Y. Research on multimodal visual question answering in medical field based on deep learning[D]. Linyi: Linyi University, 2024. [99] LIANG Y W, ZHANG R Y, ZHANG L, et al. DrugChat: towards enabling ChatGPT-like capabilities on drug molecule graphs[EB/OL]. [2024-10-14]. https://arxiv.org/abs/2309.03907. [100] DEEPSEEK-AI, GUO D Y, YANG D J, et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning[EB/OL]. [2025-01-25]. https://arxiv.org/ abs/2501.12948. [101] LI J, GUAN Z, WANG J, et al. Integrated image-based deep learning and language models for primary diabetes care[J]. Nature Medicine, 2024, 30(10): 2886-2896. |
| No related articles found! |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/