Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (7): 1806-1813.DOI: 10.3778/j.issn.1673-9418.2305086

• Theory·Algorithm • Previous Articles     Next Articles

Submodular Optimization Approach for Entity Summarization in Knowledge Graph Driven by Large Language Models

ZHANG Qi, ZHONG Hao   

  1. 1. School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China
    2. School of Computer Science, South China Normal University, Guangzhou 510631, China
  • Online:2024-07-01 Published:2024-06-28

大语言模型驱动的知识图谱实体摘要的次模优化方法

张琪,钟昊   

  1. 1. 广州商学院 信息技术与工程学院,广州 511363
    2. 华南师范大学 计算机学院,广州 510631

Abstract: The continuous expansion of the knowledge graph has made entity summarization a research hotspot. The goal of entity summarization is to obtain a brief description of an entity from large-scale triple-structured facts that describe it. The research aims to propose a submodular optimization method for entity summarization based on a large language model. Firstly, based on the descriptive information of entities, relationships, and properties in the triples, a large language model is used to embed them to vectors, effectively capturing the semantic information of the triples and generating embedding vectors containing rich semantic information. Secondly, based on the embedding vectors generated by the large language model, a method is defined to characterize the relevance between any two triples that describe the same entity. The higher the relevance between any two triples, the more similar the information contained in these two triples. Finally, based on the defined method for characterizing triple relevance, a normalized and monotonically non-decreasing submodular function is defined, modeling entity summarization as a submodular function maximization problem. Therefore, greedy algorithms with performance guarantees can be directly applied to extracting entity summaries. Testing is conducted on three public benchmark datasets, and the quality of the extracted entity summaries is evaluated using two metrics, F1 score and NDCG (normalized discounted cumulative gain). Experimental results show that the proposed approach significantly outperforms the state-of-the-art method.

Key words: entity summarization, large language model, submodular function, greedy algorithm

摘要: 知识图谱的规模不断增加,使得实体摘要成为了研究的热点问题。实体摘要的目标是从描述实体的大规模三元结构事实中得到实体的简洁描述。研究的目的是基于大语言模型提出一种次模优化方法用于实体摘要的提取。首先,基于三元组中实体、关系和属性的描述信息,采用大语言模型对它们进行嵌入,能够有效地捕捉三元组的语义信息,生成包含丰富语义信息的嵌入向量。其次,基于大语言模型生成的嵌入向量,定义任意两个描述同一实体的三元组事实之间关联度的刻画方法,任意两个三元组之间的关联度越高,表示这两个三元组之间包含的信息越相似。最后,基于上述定义的三元组关联度的刻画方法,定义正规化且单调非减的次模函数,将实体摘要建模为次模函数最大化问题,那么具有性能保证的贪心算法可以直接用于提取实体的摘要。在三个公共基准数据集上进行测试,采用[F1]值和归一化折损累计增益(NDCG)两个指标对提取的实体摘要的质量进行评估,实验结果表明该方法显著优于当前最先进的方法。

关键词: 实体摘要, 大语言模型, 次模函数, 贪心算法