人工智能赋能的查询处理与优化新技术研究综述

doi:10.3778/j.issn.1673-9418.1911063

摘要/Abstract

摘要：

数据查询处理与优化作为数据管理中最具挑战性的问题之一，一直受到广泛关注。传统的查询处理与优化技术在实际使用中需要针对特定的工作负载和数据集进行大量的手动调优，因而已经无法满足现代数据库系统的发展需求。受人工智能（AI）成功应用于多领域研究的启发，近期人工智能赋能的查询处理与优化新技术相继被提出并取得了一定的研究成果。针对这些研究工作，首先给出了人工智能赋能的查询处理与优化技术的主要任务，分析了与传统人工智能任务的区别。其次梳理了该领域的主要研究进展，并总结了主要优势与应用瓶颈。接着讨论了当前所面临的主要技术挑战。最后对该领域的未来发展进行了展望。

关键词: 查询优化, 人工智能, 机器学习, 深度学习, 数据库系统

Abstract:

As one of the most challenging problems in data management, query processing and optimization are always widely concerned by researchers. However, it is very difficult for traditional techniques to meet the diverse requirements of modern database system due to the needs of hand-tuning for specific workloads and datasets. Inspired by advances in applying artificial intelligence (AI) to multi-field researches, recently, the AI powered new techniques for query processing and optimization have been proposed and made significant success. In view of these researches, this paper first presents the main tasks of the AI powered new techniques of query processing and optimization, and analyzes the differences between the new tasks and traditional AI tasks. Second, the recent research progress is reviewed, and the main advantages and application bottlenecks are summarized. Third, this paper discusses the main challenges of the AI powered new techniques of query processing and optimization. Finally, the future research directions are prospected.

Key words: query optimization, artificial intelligence, machine learning, deep learning, database systems

宋雨萌，谷峪，李芳芳，于戈. 人工智能赋能的查询处理与优化新技术研究综述[J]. 计算机科学与探索, 2020, 14(7): 1081-1103.

SONG Yumeng, GU Yu, LI Fangfang, YU Ge. Survey on AI Powered New Techniques for Query Processing and Optimization[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(7): 1081-1103.

参考文献

[1] Zhou A Y, Jin C Q, Wang G R, et al. A survey on the management of uncertain data[J]. Chinese Journal of Computers, 2009, 32(1): 1-16. 周傲英, 金澈清, 王国仁, 等. 不确定性数据管理技术研究综述[J]. 计算机学报, 2009, 32(1): 1-16.
[2] Kraska T, Alizadeh M, Beutel A, et al. Sagedb: a learned data- base system[C]//Proceedings of the 9th Biennial Conference on Innovative Data Systems Research, Asilomar, Jan 13-16, 2019: 13.
[3] Wang W, Zhang M, Chen G, et al. Database meets deep lear-ning: challenges and opportunities[J]. ACM SIGMOD Record, 2016, 45(2): 17-22.
[4] Sun L M, Zhang S M, Ji T, et al. Survey of data management techniques powered by artificial intelligence[J]. Journal of Software, 2020, 31(3): 600-619. 孙路明, 张少敏, 姬涛，等. 人工智能赋能的数据管理技术研究[J]. 软件学报, 2020, 31(3): 600-619.
[5] Li G L, Zhou X H, Sun J, et al. A survey of machine-learning-based database techniques[J/OL]. Chinese Journal of Computers (2019-11-04) [2020-02-20]. http://kns.cnki.net/kcms/detail/11.1826.TP.20191104.1009.002.html. 李国良, 周煊赫, 孙佶, 等. 基于机器学习的数据库技术综述[J/OL]. 计算机学报(2019-11-04) [2020-02-20]. http://kns.cnki.net/kcms/detail/11.1826.TP.20191104.1009.002.html.
[6] Meng X F, Ma C H, Yang C. Survey on machine learning for database systems[J]. Journal of Computer Research and Development, 2019, 56(9): 1803-1820. 孟小峰, 马超红, 杨晨. 机器学习化数据库系统研究综述[J]. 计算机研究与发展, 2019, 56(9): 1803-1820.
[7] LeCun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015, 521(7553): 436-444.
[8] Kohonen T. An introduction to neural computing[J]. Neural Networks, 1988, 1(1): 3-16.
[9] Sutton R S, Barto A G. Reinforcement learning: an introduction[M]. Cambridge: MIT Press, 2018.
[10] Hu B, Lu Z, Li H, et al. Convolutional neural network architectures for matching natural language sentences[C]//Proceedings of the 2014 Annual Conference on Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Cambridge: MIT Press, 2014: 2042-2050.
[11] Lipton Z C, Berkowitz J, Elkan C. A critical review of recurrent neural networks for sequence learning[J]. arXiv:1506. 00019, 2015.
[12] Graefe G, Larson P A. B-tree indexes and CPU caches[C]//Proceedings of the 17th International Conference on Data Engineering, Heidelberg, Apr 2-6, 2001. Piscataway: IEEE, 2001: 349-358.
[13] Bayer R, McCreight E. Organization and maintenance of large ordered indexes[M]//Software Pioneers. Berlin, Heidelberg: Springer, 2002: 245-262.
[14] Lehman T J, Carey M J. A study of index structures for main memory database management systems[R]. University of Wisconsin-Madison. Department of Computer Sciences, 1985.
[15] Bayer R. Symmetric binary B-trees: data structure and maintenance algorithms[J]. Acta Informatica, 1972, 1(4): 290-306.
[16] Galakatos A, Markovitch M, Binnig C, et al. A-tree: a bounded approximate index structure[J]. arXiv:1801.10207, 2018.
[17] Hadian A, Heinis T. Interpolation-friendly B-trees: bridging the gap between algorithmic and learned indexes[C]//Proceedings of the 22nd International Conference on Extending Database Technology, Lisbon, Mar 26, 2019. Berlin, Heidelberg: Springer, 2019: 710-713.
[18] Kraska T, Beutel A, Chi E H, et al. The case for learned index structures[C]//Proceedings of the 2018 International Conference on Management of Data, Houston, Jun 10-15, 2018. New York: ACM, 2018: 489-504.
[19] Mitzenmacher M. Optimizing learned bloom filters by sandwiching[J]. arXiv:1803.01474, 2018.
[20] Oosterhuis H, Culpepper J S, De Rijke M. The potential of learned index structures for index compression[J]. arXiv:1811.06678, 2018.
[21] Hadian A, Heinis T. Considerations for handling updates in learned index structures[C]//Proceedings of the 2nd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Amsterdam, Jul 5, 2019. New York: ACM, 2019: 3.
[22] Marcus R, Papaemmanouil O. Towards a hands-free query optimizer through deep learning[J]. arXiv:1809.10212, 2018.
[23] Wang Z, Bapst V, Heess N, et al. Sample efficient actor-critic with experience replay[J]. arXiv:1611.01224, 2016.
[24] The PostgreSQL Global Development Group. PostgreSQL: the world??s most advanced open source database[EB/OL]. [2020-03-16]. http://www.postgresql.org/.
[25] Microsoft. SQL server[EB/OL]. [2020-03-16]. https://microsoft.com/sql-server/.
[26] Marcus R, Papaemmanouil O. Deep reinforcement learning for join order enumeration[C]//Proceedings of the 1st International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Houston, Jun 10, 2018. New York: ACM, 2018: 3.
[27] Krishnan S, Yang Z, Goldberg K, et al. Learning to optimize join queries with deep reinforcement learning[J]. arXiv: 1808.03196, 2018.
[28] Trummer I, Wang J, Maram D, et al. SkinnerDB: regret-bounded query evaluation via reinforcement learning[C]//Proceedings of the 2019 International Conference on Management of Data, Amsterdam, Jun 30-Jul 5, 2019. New York: ACM, 2019: 1153-1170.
[29] Kocsis L, Szepesvári C. Bandit based Monte-Carlo planning[C]//Proceedings of the 2006 European Conference on Machine Learning, Berlin, Sep 18-22, 2006. Berlin, Heidelberg: Springer, 2006: 282-293.
[30] Leis V, Gubichev A, Mirchev A, et al. How good are query optimizers, really?[J]. Proceedings of the VLDB Endowment, 2015, 9(3): 204-215.
[31] Akdere M, Çetintemel U, Riondato M, et al. Learning-based query performance modeling and prediction[C]//Proceedings of the 28th IEEE International Conference on Data Engineering, Washington, Apr 1-5, 2012. Piscataway: IEEE, 2012: 390-401.
[32] Marcus R, Papaemmanouil O. Plan-structured deep neural network models for query performance prediction[J]. arXiv:1902.00132, 2019.
[33] Idreos S, Zoumpatianos K, Hentschel B, et al. The data calculator: data structure design and cost synthesis from first principles and learned cost models[C]//Proceedings of the 2018 International Conference on Management of Data, Houston, Jun 10-15, 2018. New York: ACM, 2018: 535-550.
[34] Sun J, Li G. An end-to-end learning-based cost estimator[J]. arXiv:1906.02560, 2019.
[35] Karnagel T, Habich D, Lehner W. Local vs. global optimization: operator placement strategies in heterogeneous environments[C]//Proceedings of the EDBT/ICDT Workshops, Brussels, Mar 27, 2015. Berlin, Heidelberg: Springer, 2015: 48-55.
[36] Leis V, Radke B, Gubichev A, et al. Cardinality estimation done right: index-based join sampling[C]//Proceedings of the 8th Biennial Conference on Innovative Data Systems Research, Chaminade, Jan 8-11, 2017: 8.
[37] Kipf A, Vorona D, Müller J, et al. Estimating cardinalities with deep sketches[J]. arXiv:1904.08223, 2019.
[38] Lakshmi S, Zhou S. Selectivity estimation in extensible databases—a neural network approach[C]//Proceedings of the 24th International Conference on Very Large Data Bases, New York, Aug 24-27, 1998, New York: ACM, 1998: 623-627.
[39] Malik T, Burns R C, Chawla N V. A black-box approach to query cardinality estimation[C]//Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, Asilomar, Jan 7-10, 2007: 56-67.
[40] Liu H, Xu M, Yu Z, et al. Cardinality estimation using neural networks[C]//Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering, Markham, Nov 2-4, 2015. New York: ACM, 2015: 53-59.
[41] Hasan S, Thirumuruganathan S, Augustine J, et al. Multi-attribute selectivity estimation using deep learning[J]. arXiv: 1903.09999, 2019.
[42] Kipf A, Kipf T, Radke B, et al. Learned cardinalities: estimating correlated joins with deep learning[J]. arXiv:1809. 00677, 2018.
[43] Woltmann L, Hartmann C, Thiele M, et al. Cardinality estimation with local deep learning models[C]//Proceedings of the 2nd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Amsterdam, Jul 5, 2019. New York: ACM, 2019: 5.
[44] Ortiz J, Balazinska M, Gehrke J, et al. Learning state representations for query optimization with deep reinforcement learning[J]. arXiv:1803.08604, 2018.
[45] Germain M, Gregor K, Murray I, et al. Made: masked autoencoder for distribution estimation[C]//Proceedings of the 2015 International Conference on Machine Learning, Lille, Jul 6-11, 2015. New York: ACM, 2015: 881-889.
[46] Ma Q, Triantafillou P. DBEst: revisiting approximate query processing engines with machine learning models[C]//Proceedings of the 2019 International Conference on Management of Data, Amsterdam, Jun 30 - Jul 5, 2019. New York: ACM, 2019: 1553-1570.
[47] Thirumuruganathan S, Hasan S, Koudas N, et al. Approximate query processing using deep generative models[J]. arXiv:1903.10000, 2019.
[48] Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python[J]. Journal of Machine Learning Research, 2011, 12: 2825-2830.
[49] Chen T, Guestrin C. Xgboost: a scalable tree Boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, Aug 13-17, 2016. New York: ACM, 2016: 785-794.
[50] Prokhorenkova L, Gusev G, Vorobev A, et al. CatBoost: unbiased Boosting with categorical features[C]//Proceedings of the 2018 Annual Conference on Neural Information Processing Systems, Montréal, Dec 3-8, 2018. Cambridge: MIT Press, 2018: 6638-6648.
[51] Ke G, Meng Q, Finley T, et al. Lightgbm: a highly efficient gradient Boosting decision tree[C]//Proceedings of the 2017 Annual Conference on Neural Information Processing Systems, Long Beach, Dec 4-9, 2017. Cambridge: MIT Press, 2017: 3146-3154.
[52] Friedman J H. Stochastic gradient Boosting[J]. Computational Statistics & Data Analysis, 2002, 38(4): 367-378.
[53] Mozafari B. Approximate query engines: commercial challenges and research opportunities[C]//Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, May 14-19, 2017. New York: ACM, 2017: 521-524.
[54] Mozafari B, Niu N. A handbook for building an approximate query engine[J]. IEEE Data Engineering Bulletin, 2015, 38(3): 3-29.
[55] Grover A, Gummadi R, Lazaro-Gredilla M, et al. Variational rejection sampling[J]. arXiv:1804.01712, 2018.
[56] Park Y, Mozafari B, Sorenson J, et al. Verdictdb: universalizing approximate query processing[C]//Proceedings of the 2018 International Conference on Management of Data, Houston, Jun 10-15, 2018. New York: ACM, 2018: 1461-1476.
[57] Agrawal S, Chaudhuri S, Kollar L, et al. Database tuning advisor for microsoft SQL server 2005[C]//Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, Jun 14-16, 2005. New York: ACM, 2005: 930-932.
[58] Bruno N, Chaudhuri S. An online approach to physical design tuning[C]//Proceedings of the 23rd IEEE International Conference on Data Engineering, Istanbul, Apr 15-20, 2007. Piscataway: IEEE, 2007: 826-835.
[59] Chaudhuri S, Narasayya V R. An efficient, cost-driven index selection tool for Microsoft SQL server[C]//Proceedings of the 23rd International Conference on Very Large Data Bases, Athens, Aug 25-29, 1997. New York: ACM, 1997: 146-155.
[60] Chaudhuri S, Narasayya V. AutoAdmin “what-if” index analysis utility[C]//Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1998: 367-378.
[61] Chaudhuri S, Narasayya V. Self-tuning database systems: a decade of progress[C]//Proceedings of the 33rd International Conference on Very Large Data Bases, Austria, Sep 23-27, 2007. New York: ACM, 2007: 3-14.
[62] Ding B, Das S, Marcus R, et al. Ai meets ai: leveraging query executions to improve index recommendations[C]//Proceedings of the 2019 International Conference on Management of Data, Amsterdam, Jun 30-Jul 5, 2019. New York: ACM, 2019: 1241-1258.
[63] Vu T. Deep query optimization[C]//Proceedings of the 2019 International Conference on Management of Data, Amsterdam, Jun 30-Jul 5, 2019. New York: ACM, 2019: 1856-1858.
[64] Qiu T, Wang B, Shu Z W, et al. Intelligent index tuning approach for relational databases[J]. Journal of Software, 2020, 31(3): 634-647. 邱涛, 王斌, 舒昭维, 等. 一种面向关系数据库的智能索引调优方法[J]. 软件学报, 2020, 31(3): 634-647.
[65] Sun J, Li G L. An end-to-end query optimization engine based on deep learning[J]. Journal of Chifeng University(Natural Science Edition), 2019, 35(1): 1-5. 孙佶, 李国良. 一个端到端的基于深度学习的查询优化引擎[J]. 赤峰学院学报(自然科学版), 2019, 35(1): 1-5.
[66] Marcus R, Negi P, Mao H, et al. Neo: a learned query optimizer[J]. arXiv:1904.03711, 2019.
[67] Mou L, Li G, Zhang L, et al. Convolutional neural networks over tree structures for programming language processing[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, Feb 12-17, 2016. Menlo Park: AAAI, 2016: 1287-1293.
[68] Dechter R, Pearl J. Generalized best-first search strategies and the optimality of A[J]. Journal of the ACM, 1985, 32(3): 505-536.
[69] Bellman R. A Markovian decision process[J]. Indiana University Mathematics Journal, 1957, 6(4): 15.
[70] Xu L, Cole R L, Ting D. Learning to optimize federated queries[C]//Proceedings of the 2nd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Amsterdam, Jul 5, 2019. New York: ACM, 2019: 1-7.
[71] Cao W. Survey on automatic physical database design[J]. Application Research of Computers, 2012(5): 12-18. 曹巍. 数据库物理自调优研究技术综述[J]. 计算机应用研究, 2012(5): 12-18.
[72] Lu J, Chen Y, Herodotou H, et al. Speedup your analytics: automatic parameter tuning for databases and big data systems[J]. Proceedings of the VLDB Endowment, 2019, 12(12): 1970-1973.
[73] Rodd S F, Kulkarni U P. Adaptive tuning algorithm for performance tuning of database management system[J]. arXiv:1005.0972, 2010.
[74] Zheng C, Ding Z, Hu J. Self-tuning performance of database systems with neural network[C]//Proceedings of the 2014 International Conference on Intelligent Computing, Taiyuan, Aug 3-6, 2014. Berlin, Heidelberg: Springer, 2014: 1-12.
[75] Van Aken D, Pavlo A, Gordon G J, et al. Automatic database management system tuning through large-scale machine learning[C]//Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, May 14-19, 2017. New York: ACM, 2017: 1009-1024.
[76] Scikit-learn documentation—factor analysis[EB/OL]. [2019-11-28]. http://scikit-learn.org/stable/modules/generated/sklearn. decomposition.FactorAnalysis.html.
[77] Scikit-learn documentation—KMeans[EB/OL]. [2019-11-28]. http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html.
[78] Tibshirani R. Regression shrinkage and selection via the lasso[J]. Journal of the Royal Statistical Society: Series B (Methodological), 1996, 58(1): 267-288.
[79] Rasmussen C E. Gaussian processes in machine learning[C]//LNCS 3176: Advanced Lectures on Machine Learning. Berlin, Heidelberg: Springer, 2003: 63-71.
[80] Zhang J, Liu Y, Zhou K, et al. An end-to-end automatic cloud database tuning system using deep reinforcement learning[C]//Proceedings of the 2019 International Conference on Management of Data, Amsterdam, Jun 30-Jul 5, 2019. New York: ACM, 2019: 415-432.
[81] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[J]. arXiv:1509.02971, 2015.
[82] Li G, Zhou X, Li S, et al. QTune: a query-aware database tuning system with deep reinforcement learning[J]. Procee-dings of the VLDB Endowment, 2019, 12(12): 2118-2130.
[83] Ma L, Van Aken D, Hefny A, et al. Query-based workload forecasting for self-driving database management systems[C]//Proceedings of the 2018 International Conference on Management of Data, Houston, Jun 10-15, 2018. New York: ACM, 2018: 631-645.
[84] Li G L, Zhou X H. Xuan Yuan: an AI-native database systems[J]. Journal of Software, 2020, 31(3): 831-844. 李国良, 周煊赫. 轩辕: AI原生数据库系统[J]. 软件学报, 2020, 31(3): 831-844.
[85] Oracle. Oracle??s autonomous database[EB/OL]. [2020-02-20]. https://www.oracle.com/database/autonomous-database. html.
[86] Huawei. Huawei database & storage product launch[EB/OL]. [2020-02-20]. http://e.huawei.com/topic/database-storage-launch2019/cn/.
[87] Carnegie Mellon University Database Group. Peloton database management system[EB/OL]. [2020-02-20]. http://pelotondb.org.

编辑推荐 0

Metrics

阅读次数

全文

296

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	296

来源	本网站	其他网站

次数	261	35
比例	88%	12%

摘要

405

最新录用	在线预览	正式出版

0	0	405

	来源	本网站

	次数	405
	比例	100%