Loop Invariant Code Motion Algorithm for Deep Learning Operators

doi:10.3778/j.issn.1673-9418.2107046

Abstract

Abstract: TVM (tensor virtual machine) is a deep learning compiler, that translates the deep learning operators described by tensor expression to TVM IR (TVM intermediate representation) programs. After a series of operator-level optimizations on TVM IR, TVM generates the target code across diverse hardware back-ends. Tensor expression, a domain-specific language for tensor computation, performs loop transformation to operators. The result of loop transformation is a number of complicated expressions emerging in nested loop statements, which contain loop invariant code. However, in the context of deep learning applications, the traditional loop invariant code motion algorithm has severe limitations. Firstly, it’s difficult to determine the extra-benefit of moving certain invariant code out of loops. Secondly, it’s difficult to detect loop invariant code which has different orders of operands. Thirdly, it cannot process nested condition expressions well. Furthermore, there are conflicts with target hardware compiler optimizations. The application of loop invariant code motion technique is constrained by the aforementioned problems. In this paper, a new loop invariant code motion algorithm is proposed, which takes deep learning application characteristics into consideration in a heuristics way. The algorithm normalizes the program by manipulating the expression operands and simplifying the nested condition expression. This paper introduces a new cost model, which evaluates the cost of moving certain loop invariant code while the characteristics of TVM IR and target hardware back-ends are fully considered. The algorithm is implemented as a registered TVM pass on open-source compiler TVM version 0.7. To testify the effectiveness and correctness of this algorithm, experiments are conducted on TVM TOPI benchmark with 27 operators and 511 test cases under different input. Experimental results show that this algorithm improves 47.6% of operators’ performance, and achieves speedups up to 40.0%.

Key words: deep learning compiler, domain specific language, loop invariant code motion, intermediate representation

摘要： TVM是一个深度学习编译器，支持将TVM的领域专用语言即张量表达式定义的算子编译生成目标平台的代码，并在高级中间表示TVM IR上进行一系列优化。张量表达式对算子执行循环变换，产生与循环迭代变量相关的复杂表达式的计算，在多层嵌套循环内这些计算包含了大量的循环不变式。然而，传统的循环不变量外提技术不能判断不变量外提是否能带来额外收益，无法发现操作数顺序不同的循环不变表达式，不能处理嵌套的条件表达式，并且与目标平台编译器优化存在冲突等。由于这些挑战，传统的循环不变量外提算法无法直接用于深度学习编译器的优化，提出了一种融合深度学习代价函数和启发式策略的循环不变量外提算法。该算法基于深度学习编译器的高层中间表示，通过调整操作数顺序和简化嵌套条件表达式等方法规范化表达式。为了衡量优化的收益，在结合TVM IR和目标平台的特点的基础上，提出了一个新的面向深度学习的不变式外提代价指标函数。在开源编译器TVM 0.7版本上，通过新增优化遍的形式，具体实现了所介绍的算法以及代价函数。为评测算法的有效性，在Tesla P4的图形处理器（GPU）平台上对TVM TOPI的测试算子集中27个典型算子不同输入规模的511个测例进行了测试。实验结果表明47.6%的算子性能得到提升，最大加速比大于40.0%。

关键词: 深度学习编译器, 领域专用语言, 循环不变量外提, 中间表示

LIANG Jiali, HUA Baojian, LYU Yashuai, SU Zhenyu. Loop Invariant Code Motion Algorithm for Deep Learning Operators[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 127-139.

梁佳利, 华保健, 吕雅帅, 苏振宇. 面向深度学习算子的循环不变式外提算法[J]. 计算机科学与探索, 2023, 17(1): 127-139.

References

[1] CHEN T, MOREAU T, JIANG Z, et al. TVM: an automated end-to-end optimizing compiler for deep learning[C]//Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, Oct 8-10, 2018. Berkeley: USENIX Association, 2018: 578-594.
[2] WEI R, SCHWARTZ L, ADVE V. DLVM: a modern compiler infrastructure for deep learning systems[J]. arXiv:1711.03016, 2017.
[3] VASILACHE N, ZINENKO O, THEODORIDIS T, et al. Tensor comprehensions: framework-agnostic high-perform-ance machine learning abstractions[J]. arXiv:1802.04730, 2018.
[4] ROTEM N, FIX J, ABDULRASOOL S, et al. Glow: graph lowering compiler techniques for neural networks[J]. arXiv:1805.00907, 2018.
[5] CYPHERS S, BANSAL A K, BHIWANDIWALLA A, et al. Intel nGraph: an intermediate representation, compiler, and executor for deep learning[J]. arXiv:1801.08058, 2018.
[6] ROESCH J, LYUBOMIRSKY S, WEBER L, et al. Relay: a new IR for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, Philadelphia, Jun 18-22, 2018. New York: ACM, 2018: 58-68.
[7] RAGAN-KELLEY J M. Decoupling algorithms from the organization of computation for high performance image processing[D]. Cambridge: Massachusetts Institute of Tech-nology, 2014.
[8] RAGAN-KELLEY J, ADAMS A, PARIS S, et al. Decoupling algorithms from schedules for easy optimization of image processing pipelines[J]. ACM Transactions on Graphics, 2012, 31(4): 1-12.
[9] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on COmputer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 2261-2269.
[10] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems 25, Lake Tahoe, Dec 3-6, 2012: 1097-1105.
[11] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014.
[12] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[13] HOLEWINSKI J, POUCHET L N, SADAYAPPAN P. High-performance code generation for stencil computations on GPU architectures[C]//Proceedings of the 26th ACM Inter-national Conference on Supercomputing, Phoenix, Jun 25-29, 2012. New York: ACM, 2012: 311-320.
[14] AHO A V, SETHI R, ULLMAN J D. Compilers: principles, techniques, and tools[M]. Reading: Addison-Wesley, 1986.
[15] RAWAT P S, RASTELLO F, SUKUMARAN-RAJAM A, et al. Register optimizations for stencils on GPUs[C]//Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vienna, Feb 24-28, 2018. New York: ACM, 2018: 168-182.
[16] BRIGGS P, COOPER K D. Effective partial redundancy elimination[J]. ACM SIGPLAN Notices, 1994, 29(6): 159-170.
[17] WU J, BELEVICH A, BENDERSKY E, et al. gpucc: an open-source GPGPU compiler[C]//Proceedings of the 2016 International Symposium on Code Generation and Optimiza-tion, Barcelona, 2016. New York: ACM, 2016: 105-116.
[18] LEE J, HUR C K, JUNG R, et al. Reconciling high-level optimizations and low-level code in LLVM[J]. Proceedings of the ACM on Programming Languages, 2018, 2: 1-28.
[19] RAGAN-KELLEY J, BARNES C, ADAMS A, et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines[C]//Proceedings of the 2013 ACM SIGPLAN Conference on Programming Language Design and Implementation, Seattle, Jun 16-19, 2013. New York: ACM, 2013: 519-530.
[20] DURSUN H, NOMURA K, WANG W, et al. In-Core optimization of high-order stencil computations[C]//Proce-edings of the 2009 International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Jul 13-17, 2009: 533-538.
[21] DATTA K, MURPHY M, VOLKOV V, et al. Stencil com-putation optimization and auto-tuning on state-of-the-art multicore architectures[C]//Proceedings of the 2008 ACM/IEEE Conference on High Performance Computing, Austin, Nov 15-21, 2008. Piscataway: IEEE, 2008: 4.
[22] HAMMOUDA A, SIEGEL A R, SIEGEL S F. Dynamic barrier relaxations for explicit stencil computations: UDEL-CIS 2013/002[R]. 2013.
[23] CHEN T, DU Z, SUN N, et al. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning[J]. ACM SIGARCH Computer Architecture News, 2014, 42(1): 269-284.
[24] LIN D C. Citeseer, compiler support for predicated execution in superscalar processors[D]. Urbana: University of Illinois at Urbana-Champaign, 1992.
[25] MAHLKE S A, LIN D C, CHEN W Y, et al. Effective compiler support for predicated execution using the hyper-block[C]//Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Nov 1992. New York: ACM, 1992: 45-54.
[26] LATTNER C, ADVE V S. LLVM: a compilation framework for lifelong program analysis transformation[C]//Proceedings of the 2nd IEEE/ACM International Symposium on Code Generation and Optimization, San Jose, Mar 20-24, 2004. Washington: IEEE Computer Society, 2004: 75-86.
[27] STEFFEN B, KNOOP J, RüTHING O. Efficient code motion and an adaption to strength reduction[C]//LNCS 494: Proceedings of the 1991 International Joint Conference on Theory and Practice of Software Development, Brighton, Apr 8-12, 1991. Berlin, Heidelberg: Springer, 1991: 394-415.
[28] ROSEN B K, WEGMAN M N, ZADECK F K. Global value numbers and redundant computations[C]//Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Prin-ciples of Programming Languages, San Diego, Jan 10-13, 1988. New York: ACM, 1988: 12-27.
[29] ZHAO J, LI B J, NIE W, et al. AKG: automatic kernel generation for neural processing units using polyhedral transformations[C]//Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Jun 20-25, 2021. New York: ACM, 2021: 1233-1248.
[30] VASILEV V S, LEGALOV A I. Loop-invariant optimization in the pifagor language[J]. Automatic Control and Computer Sciences, 2018, 52(7): 843-849.
[31] SONG L T, KAVI K, CYTRON R. An unfolding-based loop optimization technique[C]//LNCS 2826: Proceedings of the 7th International Workshop on Software and Compilers for Embedded Systems, Vienna, Sep 24-26, 2003. Berlin,Heidelberg: Springer, 2003: 117-132.
[32] BACON D F, GRAHAM S L, SHARP O J. Compiler transformations for high-performance computing[J]. ACM Computing Surveys, 1994, 26(4): 345-420.