[1] ABADI M, BARGHAM P, CHEN J, et al. TensorFlow: a system for large-scale machine learning[C]//Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, Savannah, Nov 2-4, 2016: 265-283.
[2] PALSZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[C]//Advances in Neural Information Processing Systems 32, Vancouver, Dec 8-14, 2019: 8024-8035.
[3] CHEN T, LI M, LI Y, et al. MXNet: a flexible and efficient machine learning library for heterogeneous distributed system[J]. arXiv:1512.01274v1, 2015.
[4] Baidu. PaddlePaddle, Github[EB/OL]. [2022-06-27]. https://github.com/PaddlePaddle/Paddle.
[5] LI M, LIU Y, LIU X, et al. The deep learning compiler: a comprehensive survey[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(3): 708-727.
[6] YANG B, ZHANG J, LI J, et al. PipeMare: asynchronous pipeline parallel DNN training[C]//Proceedings of the Machine Learning and Systems 2021, Apr 5-9, 2021: 269-296.
[7] WANG G, WANG K, JIANG K, et al. Wavelet: efficient DNN training with tick-tock scheduling[C]//Proceedings of the Machine Learning and Systems 2021, Apr 5-9, 2021: 696-710.
[8] LI S, HOEFLER T. Chimera: efficiently training large-scale neural networks with bidirectional pipelines[C]//Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Nov 14-19, 2021: 27.
[9] HUANG T, LIN D L, LIN C X, et al. Taskflow: a general-purpose parallel and heterogeneous task programming system[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021, 41(5): 1448-1452.
[10] NARAYANAN D, SANTHANAM K, KAZHAMIA F, et al. Heterogeneity-aware cluster scheduling policies for deep learning workloads[C]//Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, Nov 4-6, 2020: 481-498.
[11] XLA Team within Google. XLA: TensorFlow, compiled. Google developers blog[EB/OL]. [2022-06-24]. https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html.
[12] CHEN T, MOREAU T, JIANG Z, et al. TVM: an automated end-to-end optimizing compiler for deep learning[C]//Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, Oct 8-10, 2018. Berkeley: USENIX Association, 2018: 578-594.
[13] ROTEM N, FIX J, ABDULRASOOL S, et al. Glow: graph lowering compiler techniques for neural networks[J]. arXiv: 1805.00907, 2018.
[14] CYPHERS S, BANSAL A K, BHIWANDIWALLA A, et al. Intel nGraph: an intermediate representation, compiler, and executor for deep learning[J]. arXiv:1801.08058, 2018.
[15] OpenVINO[EB/OL]. [2022-06-24]. https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html.
[16] ROESCH J, LYUBOMIRSKY S, WEBER L, et al. Relay: a new IR for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, Philadelphia, Jun 18-22, 2018. New York: ACM, 2018: 58-68.
[17] VASILACHE N, ZINENKO O, THEODORIDIS T, et al. Tensor comprehensions: framework-agnostic high performance machine learning abstractions[J]. arXiv:1802.04730, 2018.
[18] MA L, XIE Z, YANG Z, et al. Rammer: enabling holistic deep learning compiler optimizations with rTasks[C]//Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, Nov 4-6, 2020: 881-897.
[19] NIU W, GUAN J, WANG Y, et al. DNNFusion: accelerating deep neural networks execution with advanced operator fusion[C]//Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Canada, Jun 20-25, 2021: 883-898.
[20] UNGER C, JIA Z H, WU W, et al. Unity: accelerating DNN training through joint optimization of algebraic transformations and parallelization[C]//Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, Nov 4-6, 2022: 267-284.
[21] ZHAO T, HALL M, JOHANSEN H, et al. Improving communication by optimizing on-node data movement with data layout[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 27-Mar 3, 2021: 304-317.
[22] YUAN J, LI X Q, CHENG C, et al. OneFlow: redesign the distributed deep learning framework from scratch[J]. arXiv: 2110.15032, 2021.
[23] XIAO W, BHARDWAJ R, RAMJEE R, et al. Gandiva: introspective cluster scheduling for deep learning[C]//Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, Oct 8-10, 2018: 595-610.
[24] WOO T, HAVEMANN M, FRIEDMAN B D, et al. An efficient and non-intrusive GPU scheduling framework for deep learning training systems[C]//Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, Nov 9-19, 2020: 90.
[25] HUANG Y, CHENG Y, CHEN D, et al. Gpipe: efficient training of giant neural networks using pipeline parallelism[C]//Advances in Neural Information Processing Systems 32, Vancouver, Dec 8-14, 2019: 103-112.
[26] FAN S, RONG Y, MENG C, et al. DAPPLE: a pipelined data parallel approach for training large models[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 27-Mar 3, 2021: 431-445.
[27] 常爽爽, 赵栩锋, 刘震宇, 等. 基于异构多核的多类型DAG任务的响应时间分析[J]. 计算机学报, 2020, 43(6): 1052-1068.
CHANG S S, ZHAO X F, LIU Z Y, et al. Response time analysis of typed DAG tasks on heterogeneous multi-cores[J]. Chinese Journal of Computers, 2020, 43(6): 1052-1068.
[28] JIA Z, PADON O, THOMAS J, et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles, Huntsville, Oct 27-30, 2019: 47-62. |