融合启发式和Boosting的子图匹配基数估计方法

doi:10.3778/j.issn.1673-9418.2009088

计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (3): 582-590.DOI: 10.3778/j.issn.1673-9418.2009088

融合启发式和Boosting的子图匹配基数估计方法

侯文哲, 赵翔⁺()

国防科技大学信息系统工程重点实验室,长沙 410000

收稿日期:2020-09-29 修回日期:2021-04-28 出版日期:2022-03-01 发布日期:2021-05-07
通讯作者: + E-mail: xiangzhao@nudt.edu.cn
作者简介:侯文哲（1997—）,男,山东泰安人,硕士研究生,主要研究方向为图数据分析、自然语言处理。
赵翔（1986—）,男,浙江金华人,博士,副教授,硕士生导师,主要研究方向为知识图谱技术、大数据分析。
基金资助:
国家自然科学基金(61872446);国家自然科学基金(61902417);国家自然科学基金(71971212);湖南省自然科学基金(2019JJ20024)

Subgraph Matching Cardinality Estimation Combining Heuristic and Boosting Method

HOU Wenzhe, ZHAO Xiang⁺()

Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410000, China

Received:2020-09-29 Revised:2021-04-28 Online:2022-03-01 Published:2021-05-07
About author:HOU Wenzhe, born in 1997, M.S. candidate. His research intersts include graph data analysis and natural language processing.
ZHAO Xiang, born in 1986, Ph.D., associate professor, M.S. supervisor. His research interests include data graph and big data analysis.
Supported by:
National Natural Science Foundation of China(61872446);National Natural Science Foundation of China(61902417);National Natural Science Foundation of China(71971212);Natural Science Foundation of Hunan Province(2019JJ20024)

摘要/Abstract

摘要：

由于在建模关联信息方面具备天然优势,图数据已在社交网络、知识表示等方面被广泛运用。但是相较于传统的关系型数据库系统,图数据管理中的以子图匹配为代表的一系列基础操作仍有进一步优化的空间。在一个完善的图数据库系统中,为实现多个子图匹配任务的优化调度,往往需要对每个任务的代价,尤其是匹配结果的基数进行准确预估。然而,现有的子图匹配基数预估方法缺乏对图结构信息的充分考量,且在多结点匹配中存在严重的潜在累计误差。BoostCard方法通过对各结点的邻域信息进行表示,来聚合结点的局部结构特征,同时运用统计方法估计不同结点之间连接成边的概率从而实现匹配基数的初步预测。而后在初期获取的结点结构特征的基础上,采用提升学习的思想对预测结果进行全局补偿,可实现智能化的子图匹配基数估计,是一种具有广泛适用性的子图匹配预测框架。通过实验可知,相比于传统的统计方法,BoostCard在真实数据集的子图匹配基数估计,尤其是多结点子图匹配问题上有明显的性能提升。

关键词: 图数据, 子图匹配, 基数估计, 提升学习

Abstract:

Attributed to its innate advantage in modeling relational information, graph data have been widely leveraged in various applications including social network, knowledge representation, etc. Compared with traditional relational database systems, primitive operators in graph data management, represented by subgraph matching, still observe space for further optimization. In a fully-fledged graph database system, in order to optimize the schedule of multiple subgraph matching tasks, it usually necessitates accurate estimation of the cost of every task, especially of the cardinality of matching results. However, current methods for cardinality estimation of subgraph matching fail to fully exploit the structural information in the graph, and moreover, it may potentially result in serious error accumulation. In this light, BoostCard is proposed to aggregate the local structural features via representing neighborhood information of nodes. Meanwhile, it utilizes statistical method to predict whether there is an edge between two nodes. And it adopts the Boosting strategy to compensate globally based on the acquired features of nodes. Hence, it can achieve intelligent cardinality estimation of subgraph matching, making itself a widely applicable framework for the task. In comparison with conventional statistic-based methods, BoostCard offers significant performance gain in the cardinality estimation of multi-node subgraph matching over real-life data sets.

Key words: graph data, subgraph matching, cardinality estimation, Boosting learning

中图分类号:

TP391

侯文哲, 赵翔. 融合启发式和Boosting的子图匹配基数估计方法[J]. 计算机科学与探索, 2022, 16(3): 582-590.

HOU Wenzhe, ZHAO Xiang. Subgraph Matching Cardinality Estimation Combining Heuristic and Boosting Method[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(3): 582-590.

图/表 7

图1 精确匹配示例

Fig.1 Sample of exact subisomorphism

图2 结构信息获取

Fig.2 Structural information acquirement

图3 损失函数

Fig.3 Loss function

表1 预测 r 2值

Table 1 Estimated r 2 values

方法	Yeast				Human				Random
方法	$n = 2$	$n = 3$	$n = 4$	$n = 5$	$n = 2$	$n = 3$	$n = 4$	$n = 5$	$n = 2$	$n = 3$	$n = 4$	$n = 5$
STAT	0.922	0.392	—	—	0.681	—	—	—	0.924	—	—	—
BoostCard	0.900	0.800	0.663	0.422	0.205	0.175	0.135	0.110	0.895	0.737	0.448	—

表1 预测 r 2值

Table 1 Estimated r 2 values

方法	Yeast				Human				Random
方法	$n = 2$	$n = 3$	$n = 4$	$n = 5$	$n = 2$	$n = 3$	$n = 4$	$n = 5$	$n = 2$	$n = 3$	$n = 4$	$n = 5$
STAT	0.922	0.392	—	—	0.681	—	—	—	0.924	—	—	—
BoostCard	0.900	0.800	0.663	0.422	0.205	0.175	0.135	0.110	0.895	0.737	0.448	—

图4 估计均方误差

Fig.4 Estimation of mean-square error

表2 预测补偿器准确率

Table 2 Accuracy of estimation compensation

$δ$	Human		Facebook
$δ$	$n = 2$	$n = 5$	$n = 2$	$n = 5$
0.25	0.851	0.818	0.799	0.762
0.50	0.878	0.856	0.827	0.787
0.75	0.912	0.904	0.856	0.794

表2 预测补偿器准确率

Table 2 Accuracy of estimation compensation

$δ$	Human		Facebook
$δ$	$n = 2$	$n = 5$	$n = 2$	$n = 5$
0.25	0.851	0.818	0.799	0.762
0.50	0.878	0.856	0.827	0.787
0.75	0.912	0.904	0.856	0.794

图5 预测补偿准确率与正确执行数

Fig.5 Accuracy and recall of estimation compensation

参考文献 18

[1]	KIPF A, KIPF T, RADKE B, et al. Learned cardinalities: esti-mating correlated joins with deep learning[C]// Proceedings of the 9th Biennial Conference on Innovative Data Systems Research, Asilomar, Jan 13-16, 2019: 1-8.
[2]	KOŁACZKOWSKI P, RYBIŃSKI H. Automatic index selec-tion in RDBMS by exploring query execution plan space[M]// RAS Z W, DARDZINSKA A. Berlin, Heidelberg: Springer, 2009.
[3]	SUN J, LI G L. An end-to-end learning-based cost estimator[J]. Proceedings of the VLDB Endowment, 2019, 13(3): 307-319. DOI URL
[4]	MADUKO A, ANYANWU K, SHETH A P, et al. Graph sum-maries for subgraph frequency estimation[C]// LNCS 5021: Proceedings of the 5th European Semantic Web Conference on Semantic Web: Research and Applications, Tenerife, Jun 1-5, 2008. Berlin, Heidelberg: Springer, 2008: 508-523.
[5]	PARADIES M, VASILYEVA E, MOCAN A, et al. Robust cardinality estimation for subgraph isomorphism queries on property graphs[C]// LNCS 9579: Proceedings of the Biome-dical Data Management and Graph Online Querying, Waiko-loa, Aug 31-Sep 4, 2015. Cham: Springer, 2015: 184-198.
[6]	于静, 刘燕兵, 张宇, 等. 大规模图数据匹配技术综述[J]. 计算机研究与发展, 2015, 52(2): 391-409.
	YU J, LIU Y B, ZHANG Y, et al. Survey on large-scale graph pattern matching[J]. Journal of Computer Research and Development, 2015, 52(2): 391-409.
[7]	ULLMANN J R. An algorithm for subgraph isomorphism[J]. Journal of the ACM, 1976, 23(1): 31-42. DOI URL
[8]	CORDELLA L P, FOGGIA P, SANSONE C, et al. A (sub) graph isomorphism algorithm for matching large graphs[J]. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 2004, 26(10): 1367-1372.
[9]	ZOU L, CHEN L, YU J X, et al. A novel spectral coding in a large graph database[C]// Proceedings of the 11th Interna-tional Conference on Extending Database Technology, Nantes, Mar 25-29, 2008. New York: ACM, 2008: 181-192.
[10]	ZHAO P X, YU J X, YU P S. Graph indexing: tree + delta <= graph[C]// Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Sep 23-27, 2007. New York: ACM, 2007: 938-949.
[11]	SHANG H C, ZHANG Y, LIN X M, et al. Taming verifica-tion hardness: an efficient algorithm for testing subgraph isomorphism[J]. Proceedings of the VLDB Endowment, 2008, 1(1): 364-375. DOI URL
[12]	SHASHA D E, WANG J T L, GIUGNO R. Algorithmics and applications of tree and graph searching[C]// Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Madison, Jun 3-5, 2002. New York: ACM, 2002: 39-52.
[13]	ZHANG S J, LI S R, YANG J. GADDI: distance index based subgraph matching in biological networks[C]// Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, Saint Peter-sburg, Mar 24-26, 2009. New York: ACM, 2009: 192-203.
[14]	YAN X F, YU P S, HAN J W. Graph indexing: a frequent structure-based approach[C]// Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, Jun 13-18, 2004. New York: ACM, 2004: 335-346.
[15]	STOCKER M, SEABORNE A, BERNSTEIN A, et al. SPARQL basic graph pattern optimization using selectivity estimation[C]// Proceedings of the 17th International Conference on World Wide Web, Beijing, Apr 21-25, 2008. New York: ACM, 2008: 595-604.
[16]	NEUMANN T, MOERKOTTE G. Characteristic sets: accu-rate cardinality estimation for RDF queries with multiple joins[C]// Proceedings of the 27th International Conference on Data Engineering, Hannover, Apr 11-16, 2011. Washington: IEEE Computer Society, 2011: 984-994.
[17]	MARCUS R C, PAPAEMMANOUIL O. Plan-structured deep neural network models for query performance prediction[J]. Proceedings of the VLDB Endowment, 2019, 12(11): 2150-8097.
[18]	XIROGIANNOPOULOS K, KHURANA U, DESHPANDE A. GraphGen: exploring interesting graphs in relational data[J]. Proceedings of the VLDB Endowment, 2015, 8(12): 2032-2035. DOI URL

编辑推荐 0

Metrics

阅读次数

全文

536

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	183	9	0	344

来源	本网站	其他网站

次数	521	15
比例	97%	3%

摘要

301

最新录用	在线预览	正式出版

21	0	280

	来源	本网站

	次数	301
	比例	100%

融合启发式和Boosting的子图匹配基数估计方法

Subgraph Matching Cardinality Estimation Combining Heuristic and Boosting Method

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 18

相关文章 8

编辑推荐 0

Metrics

[1]	胡自松, 王丽珍, Vanha Tran, 周丽华. 基于图数据库的空间频繁并置模式挖掘[J]. 计算机科学与探索, 2022, 16(4): 806-821.
[2]	李鹏辉, 翟正利, 冯舒. 图对抗防御研究进展[J]. 计算机科学与探索, 2021, 15(12): 2292-2303.
[3]	张晓琳，袁昊晨，李卓麟，张换香，刘娇. 面向子图匹配的社会网络隐私保护方法[J]. 计算机科学与探索, 2019, 13(9): 1504-1515.
[4]	楼昀恺，王朝坤. 使用社区结构信息的子图匹配算法优化方法[J]. 计算机科学与探索, 2019, 13(1): 1-22.
[5]	许嘉，张千桢，赵翔，吕品，李陶深. 基于结构分解的动态图增量匹配算法[J]. 计算机科学与探索, 2018, 12(8): 1214-1224.
[6]	李文鹏，王建彬，林泽琦，赵俊峰，邹艳珍，谢冰. 面向开源软件项目的软件知识图谱构建方法[J]. 计算机科学与探索, 2017, 11(6): 851-862.
[7]	王虹旭，吴斌，刘旸. 基于Spark的并行图数据分析系统[J]. 计算机科学与探索, 2015, 9(9): 1066-1074.
[8]	王楠，王斌，李晓华，杨晓春. 支持动态图数据的子图查询方法[J]. 计算机科学与探索, 2014, 8(2): 139-149.