Dual-Layer Fusion Knowledge Reasoning with Enhanced Multi-modal Features

doi:10.3778/j.issn.1673-9418.2312065

Abstract

Abstract: Most of the existing multi-modal knowledge reasoning methods use splicing or attention to directly fuse the multi-modal features extracted from the pre-trained model, often ignoring the heterogeneity and interaction complexity between different modes. Therefore, a two-layer fusion knowledge inference method with multi-modal feature enhancement is proposed. The structural information embedding module uses adaptive graph attention mechanism to filter and aggregate key neighbor information to enhance the semantic representation of entity and relationship embedding. The multi-modal embedding infor-mation module uses different attention mechanisms to pay attention to the unique features of different modal data and the common features among the multi-modal data, and uses the complementary information of the common features to carry out modal interaction, so as to reduce the heterogeneity difference between modes. The multi-modal feature fusion module adopts a two-layer fusion strategy combining low-rank multi-modal feature fusion and decision fusion to realize the dynamic and complex interaction of multi-modal data between and within modes, and comprehensively considers the contribution degree of each mode in inference to obtain more comprehensive prediction results. To verify the effectiveness of the proposed method, experiments are carried out on the FB15K-237, DB15K and YAGO15K datasets, respectively. The results show that compared with the multi-modal reasoning method, MRR and Hits@1 have an average improvement of 3.6% and 2.2% respectively, and compared with the single-modal inference method, MRR and Hits@1 have an average improvement of 13.7% and 14.6%, respectively on the FB15K-237 dataset.

Key words: multi-modal knowledge graph, link prediction, knowledge reasoning, multi-modal feature fusion

摘要： 现有的多模态知识推理方法大多采用拼接或注意力的方式，将预训练模型提取到的多模态特征直接进行融合，往往忽略了不同模态之间的异构性和交互的复杂性。为此，提出了一种多模态特征增强的双层融合知识推理方法。结构信息嵌入模块采用自适应图注意力机制筛选并聚合关键的邻居信息，用来增强实体和关系嵌入的语义表达；多模态嵌入信息模块使用不同的注意力机制关注不同模态数据的独有特征，以及多模态数据间的共性特征，利用共性特征的互补信息进行模态交互，以减少模态间异构性差异；多模态特征融合模块采用将低秩多模态特征融合和决策融合相结合的双层融合策略，实现了多模态数据在模态间和模态内的动态复杂交互，并综合考虑每种模态在推理中的贡献度，得到更全面的预测结果。为了验证方法的有效性，分别在FB15K-237、DB15K和YAGO15K数据集上进行了实验。结果表明：该方法相比多模态推理方法，在FB15K-237数据集上MRR和Hits@1分别平均提升3.6%和2.2%；相比单模态推理方法，MRR和Hits@1分别平均提升13.7%和14.6%。

关键词: 多模态知识图谱, 链接预测, 知识推理, 多模态特征融合

JING Boxiang, WANG Hairong, WANG Tong, YANG Zhenye. Dual-Layer Fusion Knowledge Reasoning with Enhanced Multi-modal Features[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(2): 406-416.

荆博祥, 王海荣, 王彤, 杨振业. 多模态特征增强的双层融合知识推理方法[J]. 计算机科学与探索, 2025, 19(2): 406-416.

References

[1] LI C, LIANG M, QIU D. An intelligent search system based on knowledge graph[C]//Proceedings of the 2022 International Conference on Artificial Intelligence of Things and Crowdsensing. Piscataway: IEEE, 2022: 66-70.
[2] DING Y, YU J, LIU B, et al. MuKEA: multimodal knowledge extraction and accumulation for knowledge-based visual question answering[C]//Proceedings of the 2022 IEEE/CVF Con-ference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5089-5098.
[3] WANG C, LI L, ZHANG H, et al. Quaternion-based knowledge graph neural network for social recommendation[J]. Knowledge-Based Systems, 2022, 257: 109940.
[4] 张天成, 田雪, 孙相会, 等. 知识图谱嵌入技术研究综述[J]. 软件学报, 2023, 34(1): 277-311.
ZHANG T C, TIAN X, SUN X H, et al. Overview on knowledge graph embedding technology research[J]. Journal of Software, 2023, 34(1): 277-311.
[5] BORDES A, USUNIER N, GARCIA-DURAN A, et al. Translating embeddings for modeling multi-relational data[C]//Advances in Neural Information Processing Systems 26, Lake Tahoe, Dec 5-8, 2013: 2787-2795.
[6] DETTMERS T, MINERVINI P, STENETORP P, et al. Convolutional 2D knowledge graph embeddings[J]. Proceedings of the AAAI Conference on Artificial Intelligence， 2018, 32(1): 1811-1818.
[7] SUN Z, DENG Z H, NIE J Y, et al. RotatE: knowledge graph embedding by relational rotation in complex space[C]//Proceedings of the 7th International Conference on Learning Representations, New Orleans, May 6-9, 2019.
[8] XIE R, LIU Z, JIA J, et al. Representation learning of knowledge graphs with entity descriptions[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2016, 30(1): 2659-2665.
[9] XIE R, LIU Z, LUAN H, et al. Image-embodied knowledge representation learning[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Aug 19-25, 2017: 3140-3146.
[10] MOUSSELLY-SERGIEH H, BOTSCHEN T, GUREVYCH I, et al. A multimodal translation-based approach for knowledge graph representation learning[C]//Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. Stroudsburg: ACL, 2018: 225-234.
[11] WANG Z, LI L, LI Q, et al. Multimodal data enhanced representation learning for knowledge graphs[C]//Proceedings of the 2019 International Joint Conference on Neural Networks. Piscataway: IEEE, 2019: 1-8.
[12] LU X, WANG L, JIANG Z, et al. MMKRL: a robust embedding approach for multi-modal knowledge graph representation learning[J]. Applied Intelligence, 2022, 52(7): 7480-7497.
[13] WANG E, YU Q, CHEN Y, et al. Multi-modal knowledge graphs representation learning via multi-headed self-attention[J]. Information Fusion, 2022, 88: 78-85.
[14] ZHANG Y, FANG Q, QIAN S, et al. Multi-modal multi-relational feature aggregation network for medical knowledge representation learning[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 3956-3965.
[15] LIU W, DUAN H, LI Z, et al. Entity representation learning with multimodal neighbors for link prediction in knowledge graph[C]//Proceedings of the 2021 7th International Conference on Computer and Communications. Piscataway: IEEE, 2021: 1628-1634.
[16] VELICKOVIC P, CUCURULL G, Casanova A, et al. Graph attention networks[C]//Proceedings of the 6th International Conference on Learning Representations, 2018.
[17] FANG Q, ZHANG X, HU J, et al. Contrastive multi-modal knowledge graph representation learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(9): 8983-8996.
[18] LIANG S, ZHU A, ZHANG J, et al. Hyper-node relational graph attention network for multi-modal knowledge graph completion[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(2): 1-21.
[19] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[20] YAO L, MAO C, LUO Y. KG-BERT: BERT for knowledge graph completion[EB/OL]. [2023-10-24].https://arxiv.org/abs/1909.03193.
[21] CHEN X, ZHANG N, LI L, et al. Hybrid transformer with multi-level fusion for multimodal knowledge graph completion[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2022: 904-915.
[22] LIANG K, ZHOU S, LIU Y, et al. Structure guided multi-modal pre-trained transformer for knowledge graph reasoning[EB/OL]. [2023-10-24]. https://arxiv.org/abs/2307.03591.
[23] ZHAI H, LV X, HOU Z, et al. MLSFF: multi-level structural features fusion for multi-modal knowledge graph completion[J]. Mathematical Biosciences and Engineering, 2023, 20(8): 14096-14116.
[24] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2247-2256.
[25] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Stroudsburg: ACL, 2019: 4171-4186.
[26] LIU Y, LI H, GARCIA-DURAN A, et al. MMKG: multi-modal knowledge graphs[C]//Proceedings of the 16th International Conference on the Semantic Web. Cham: Springer,2019: 459-474.
[27] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Representations, San Diego, May 7-9, 2015.
[28] ZADEH A, CHEN M, CAMBRIA E, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2017: 1103-1114.
[29] JAISWAL A, BABU A R, ZADEH M Z, et al. A survey on contrastive self-supervised learning[J]. Technologies, 2020, 9(1): 2.
[30] YANG B, YIH W, HE X, et al. Embedding entities and relations for learning and inference in knowledge bases[C]// Proceedings of the 3rd International Conference on Learning Representations, San Diego, May 7-9, 2015.
[31] SCHLICHTKRULL M, KIPF T N, BLOEM P, et al. Modeling relational data with graph convolutional networks[C]//Proceedings of the 15th International Conference on the Semantic Web. Cham: Springer, 2018: 593-607.
[32] BALAŽEVIĆ I, ALLEN C, HOSPEDALES T M. TuckER: tensor factorization for knowledge graph completion[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 5184-5193.
[33] WANG M, WANG S, YANG H, et al. Is visual context really helpful for knowledge graph? A representation learning perspective[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 2735-2743.