计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (3): 582-590.DOI: 10.3778/j.issn.1673-9418.2009088

• 人工智能 • 上一篇    下一篇

融合启发式和Boosting的子图匹配基数估计方法

侯文哲, 赵翔+()   

  1. 国防科技大学 信息系统工程重点实验室,长沙 410000
  • 收稿日期:2020-09-29 修回日期:2021-04-28 出版日期:2022-03-01 发布日期:2021-05-07
  • 通讯作者: + E-mail: xiangzhao@nudt.edu.cn
  • 作者简介:侯文哲(1997—),男,山东泰安人,硕士研究生,主要研究方向为图数据分析、自然语言处理。
    赵翔(1986—),男,浙江金华人,博士,副教授,硕士生导师,主要研究方向为知识图谱技术、大数据分析。
  • 基金资助:
    国家自然科学基金(61872446);国家自然科学基金(61902417);国家自然科学基金(71971212);湖南省自然科学基金(2019JJ20024)

Subgraph Matching Cardinality Estimation Combining Heuristic and Boosting Method

HOU Wenzhe, ZHAO Xiang+()   

  1. Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410000, China
  • Received:2020-09-29 Revised:2021-04-28 Online:2022-03-01 Published:2021-05-07
  • About author:HOU Wenzhe, born in 1997, M.S. candidate. His research intersts include graph data analysis and natural language processing.
    ZHAO Xiang, born in 1986, Ph.D., associate professor, M.S. supervisor. His research interests include data graph and big data analysis.
  • Supported by:
    National Natural Science Foundation of China(61872446);National Natural Science Foundation of China(61902417);National Natural Science Foundation of China(71971212);Natural Science Foundation of Hunan Province(2019JJ20024)

摘要:

由于在建模关联信息方面具备天然优势,图数据已在社交网络、知识表示等方面被广泛运用。但是相较于传统的关系型数据库系统,图数据管理中的以子图匹配为代表的一系列基础操作仍有进一步优化的空间。在一个完善的图数据库系统中,为实现多个子图匹配任务的优化调度,往往需要对每个任务的代价,尤其是匹配结果的基数进行准确预估。然而,现有的子图匹配基数预估方法缺乏对图结构信息的充分考量,且在多结点匹配中存在严重的潜在累计误差。BoostCard方法通过对各结点的邻域信息进行表示,来聚合结点的局部结构特征,同时运用统计方法估计不同结点之间连接成边的概率从而实现匹配基数的初步预测。而后在初期获取的结点结构特征的基础上,采用提升学习的思想对预测结果进行全局补偿,可实现智能化的子图匹配基数估计,是一种具有广泛适用性的子图匹配预测框架。通过实验可知,相比于传统的统计方法,BoostCard在真实数据集的子图匹配基数估计,尤其是多结点子图匹配问题上有明显的性能提升。

关键词: 图数据, 子图匹配, 基数估计, 提升学习

Abstract:

Attributed to its innate advantage in modeling relational information, graph data have been widely leveraged in various applications including social network, knowledge representation, etc. Compared with traditional relational database systems, primitive operators in graph data management, represented by subgraph matching, still observe space for further optimization. In a fully-fledged graph database system, in order to optimize the schedule of multiple subgraph matching tasks, it usually necessitates accurate estimation of the cost of every task, especially of the cardinality of matching results. However, current methods for cardinality estimation of subgraph matching fail to fully exploit the structural information in the graph, and moreover, it may potentially result in serious error accumulation. In this light, BoostCard is proposed to aggregate the local structural features via representing neighborhood information of nodes. Meanwhile, it utilizes statistical method to predict whether there is an edge between two nodes. And it adopts the Boosting strategy to compensate globally based on the acquired features of nodes. Hence, it can achieve intelligent cardinality estimation of subgraph matching, making itself a widely applicable framework for the task. In comparison with conventional statistic-based methods, BoostCard offers significant performance gain in the cardinality estimation of multi-node subgraph matching over real-life data sets.

Key words: graph data, subgraph matching, cardinality estimation, Boosting learning

中图分类号: