基于多模态和知识蒸馏的教材知识图谱构建方法

doi:10.3778/j.issn.1673-9418.2406054

摘要/Abstract

摘要： 为了高效构建教育领域多模态学科知识图谱，提出了基于大模型知识蒸馏和多模型协作推理的教材文本实体关系抽取算法。在模型训练阶段，利用闭源的千亿参数模型对文本数据进行标注，实现隐式知识蒸馏。然后对开源十亿规模参数模型进行领域数据指令微调，提升开源模型实体关系抽取任务的指令遵循能力。在模型推理阶段，闭源模型作为指导模型，开源的十亿规模参数模型作为执行模型。实验结果表明知识蒸馏、多模型协作、领域数据指令微调具有有效性，显著提高了基于指令提示的教材文本实体关系抽取任务的效果。提出了显隐式知识增强的教材示意图多模态命名实体识别算法。利用图像OCR、视觉语言模型等技术提取了教材示意图中的文字信息、全局内容描述信息。通过显式知识库检索增强和隐式LLM提示增强的方法，得到图像-标题对中可能关联的辅助知识，并将显式知识库和隐式LLM得到的知识进一步融合，形成最终的辅助知识。将示意图辅助知识和示意图标题进行拼接，实现教材示意图标题的多模态命名实体识别。实验结果表明，该算法具有先进性，同时增强了算法的可解释性。

关键词: 大语言模型, 学科知识图谱, 实体关系抽取, 多模态命名实体识别, 知识蒸馏

Abstract: In order to efficiently construct a multimodal subject knowledge graph in the field of education, a textbook text entity relationship extraction algorithm based on large model knowledge distillation and multi-model collaborative reasoning is proposed. During the model training phase, this paper uses a closed source model with 100 billion parameters to annotate text data and achieve implicit knowledge distillation. Then, this paper fine-tunes the domain data instructions for the open-source billion scale parameter model to enhance the instruction compliance ability of the entity relationship extraction task of the open-source model. In the model inference stage, the closed source model serves as the guiding model, and the open-source billion scale parameter model serves as the execution model. Experimental results show that knowledge distillation, multi-model collaboration, and domain data instruction fine-tuning are effective, significantly improving the effectiveness of textbook text entity relationship extraction tasks based on instruction prompts. A multimodal named entity recognition algorithm for textbook diagrams with explicit and implicit knowledge enhancement has been proposed. Firstly, this paper uses techniques such as image OCR (optical character recognition) and visual language modeling to extract textual information and global content description information from textbook diagrams. Then, by using explicit knowledge base retrieval and implicit LLM hint enhancement methods, auxiliary knowledge that may be associated with image title pairs is obtained. The knowledge obtained from explicit knowledge base and implicit LLM is further fused to form the final auxiliary knowledge. Finally, the auxiliary knowledge of the schematic diagram is combined with the schematic diagram title to achieve multimodal named entity recognition of the textbook schematic diagram title. Experimental results show that the algorithm is advanced and the interpretability of the algorithm is enhanced.

Key words: large language model, disciplinary knowledge graph, entity relationship extraction, multimodal named entity recognition, knowledge distillation

刘军, 冷芳玲, 吴旺旺, 鲍玉斌. 基于多模态和知识蒸馏的教材知识图谱构建方法[J]. 计算机科学与探索, 2024, 18(11): 2901-2911.

LIU Jun, LENG Fangling, WU Wangwang, BAO Yubin. Construction Method of Textbook Knowledge Graph Based on Multimodal and Knowledge Distillation[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 2901-2911.

参考文献

[1] 怀进鹏. 携手推动数字教育应用、共享与创新[J]. 中国教育信息化, 2024, 30(2): 2-9.
HUAI J P. Work together to promote the application, sharing and innovation of digital education[J]. Chinese Journal of ICT in Education, 2024, 30(2): 2-9.
[2] 郑庆华. 人工智能赋能创建未来教育新格局[J]. 中国高教研究, 2024(3): 1-7.
ZHENG Q H. Artificial intelligence enables the creation of a new future education landscape[J]. China Higher Education Research, 2024(3): 1-7.
[3] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Jun 2-7, 2019. Stroudsburg: ACL, 2019: 4171-4186.
[4] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1877-1901.
[5] DU Z, QIAN Y, LIU X, et al. GLM: general language model pretraining with autoregressive blank infilling[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, May 22-27, 2022. Strouds-burg: ACL, 2022: 320-335.
[6] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 8748-8763.
[7] JOSIFOSKI M, DE C N, PEYRARD M, et al. GenIE: generative information extraction[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, Seattle, Jul 10-15, 2022. Stroudsburg: ACL, 2022: 4626-4643.
[8] LECON Y, BENGIO Y. Word-level training of a handwritten word recognizer based on convolutional neural networks[C]//Proceedings of the 12th IAPR International Conference on Pattern Recognition, Jerusalem, Oct 9-13, 1994: 88-92.
[9] SIEGELMANN H T. Recurrent neural networks[M]//Computer Science Today. Berlin, Heidelberg: Springer,1995: 29-45.
[10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[11] HOCHREITER S, SCHMIDHUBER J. Long short-term mem-ory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[12] SUTTON C, MCCALLUM A. An introduction to conditional random fields[J]. Foundations and Trends in Machine Learning, 2012, 4(4): 267-373.
[13] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research, 2020, 21.
[14] LEWIS M, LIU Y, GOYAL N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 5-10, 2020. Stroudsburg: ACL, 2020:7871-7880.
[15] OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[C]//Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 27730-27744.
[16] LU Y J, LIU Q, DAI D, et al. Unified structure generation for universal information extraction[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, May 22-27, 2022. Stroudsburg: ACL, 2022: 5755-5772.
[17] LOU J, LU Y, DAI D, et al. Universal information extraction as unified semantic matching[C]//Proceedings of the 2023 AAAI Conference on Artificial Intelligence, Washington, Feb 7-14, 2023. Menlo Park: AAAI, 2023: 13318-13326.
[18] YU T, JIANG C, LOU C, et al. SeqGPT: an out-of-the-box large language model for open domain sequence understanding[C]//Proceedings of the 2024 AAAI Conference on Artificial Intelligence, Vancouver, Feb 20-27, 2024. Menlo Park: AAAI, 2024: 19458-19467.
[19] MOON S, NEVES L, CARVALHO V. Multimodal named entity recognition for short social media posts[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Jun 1-6, 2018.Stroudsburg: ACL, 2018: 852-860.
[20] YU J, JIANG J, YANG L, et al. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 5-10, 2020. Stroudsburg: ACL, 2020: 3342-3352.
[21] ZHANG D, WEI S, LI S, et al. Multi-modal graph fusion for named entity recognition with targeted visual guidance[C]//Proceedings of the 2021 AAAI Conference on Artificial Intelligence, Feb 2-9, 2021. Menlo Park: AAAI, 2021: 14347-14355.
[22] CHEN F, LIU J, JI K, et al. Learning implicit entity-object relations by bidirectional generative alignment for multimodal NER[C]//Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Oct 29-Nov 3, 2023. New York: ACM, 2023: 4555-4563.
[23] LIU P, LI H, REN Y, et al. Hierarchical aligned multimodal learning for NER on Tweet posts[C]//Proceedings of the 2024 AAAI Conference on Artificial Intelligence, Vancouver, Feb 20-27, 2024. Menlo Park: AAAI, 2024: 18680-18688.
[24] JIA M, SHEN L, SHEN X, et al. MNER-QG: an end-to-end MRC framework for multimodal named entity recognition with query grounding[C]//Proceedings of the 2023 AAAI Conference on Artificial Intelligence, Washington, Feb 7-14, 2023. Menlo Park: AAAI, 2023: 8032-8040.
[25] LI J, LI H, PAN Z, et al. Prompting ChatGPT in MNER: enhanced multimodal named entity recognition with auxiliary refined knowledge[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, Dec 6-10, 2023. Stroudsburg: ACL, 2023: 2787-2802.
[26] HU E J, SHEN Y, WALLIS P, et al. LoRA: low-rank adaptation of large language models[C]//Proceedings of the 10th International Conference on Learning Representations, Apr 25-29, 2022.