[1] CHEN T, MOREAU T, JIANG Z, et al. TVM: an automated end-to-end optimizing compiler for deep learning[C]//Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, Oct 8-10, 2018. Berkeley: USENIX Association, 2018: 578-594.
[2] WEI R, SCHWARTZ L, ADVE V. DLVM: a modern compiler infrastructure for deep learning systems[J]. arXiv:1711.03016, 2017.
[3] VASILACHE N, ZINENKO O, THEODORIDIS T, et al. Tensor comprehensions: framework-agnostic high-perform-ance machine learning abstractions[J]. arXiv:1802.04730, 2018.
[4] ROTEM N, FIX J, ABDULRASOOL S, et al. Glow: graph lowering compiler techniques for neural networks[J]. arXiv:1805.00907, 2018.
[5] CYPHERS S, BANSAL A K, BHIWANDIWALLA A, et al. Intel nGraph: an intermediate representation, compiler, and executor for deep learning[J]. arXiv:1801.08058, 2018.
[6] ROESCH J, LYUBOMIRSKY S, WEBER L, et al. Relay: a new IR for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, Philadelphia, Jun 18-22, 2018. New York: ACM, 2018: 58-68.
[7] RAGAN-KELLEY J M. Decoupling algorithms from the organization of computation for high performance image processing[D]. Cambridge: Massachusetts Institute of Tech-nology, 2014.
[8] RAGAN-KELLEY J, ADAMS A, PARIS S, et al. Decoupling algorithms from schedules for easy optimization of image processing pipelines[J]. ACM Transactions on Graphics, 2012, 31(4): 1-12.
[9] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on COmputer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 2261-2269.
[10] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems 25, Lake Tahoe, Dec 3-6, 2012: 1097-1105.
[11] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014.
[12] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[13] HOLEWINSKI J, POUCHET L N, SADAYAPPAN P. High-performance code generation for stencil computations on GPU architectures[C]//Proceedings of the 26th ACM Inter-national Conference on Supercomputing, Phoenix, Jun 25-29, 2012. New York: ACM, 2012: 311-320.
[14] AHO A V, SETHI R, ULLMAN J D. Compilers: principles, techniques, and tools[M]. Reading: Addison-Wesley, 1986.
[15] RAWAT P S, RASTELLO F, SUKUMARAN-RAJAM A, et al. Register optimizations for stencils on GPUs[C]//Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vienna, Feb 24-28, 2018. New York: ACM, 2018: 168-182.
[16] BRIGGS P, COOPER K D. Effective partial redundancy elimination[J]. ACM SIGPLAN Notices, 1994, 29(6): 159-170.
[17] WU J, BELEVICH A, BENDERSKY E, et al. gpucc: an open-source GPGPU compiler[C]//Proceedings of the 2016 International Symposium on Code Generation and Optimiza-tion, Barcelona, 2016. New York: ACM, 2016: 105-116.
[18] LEE J, HUR C K, JUNG R, et al. Reconciling high-level optimizations and low-level code in LLVM[J]. Proceedings of the ACM on Programming Languages, 2018, 2: 1-28.
[19] RAGAN-KELLEY J, BARNES C, ADAMS A, et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines[C]//Proceedings of the 2013 ACM SIGPLAN Conference on Programming Language Design and Implementation, Seattle, Jun 16-19, 2013. New York: ACM, 2013: 519-530.
[20] DURSUN H, NOMURA K, WANG W, et al. In-Core optimization of high-order stencil computations[C]//Proce-edings of the 2009 International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Jul 13-17, 2009: 533-538.
[21] DATTA K, MURPHY M, VOLKOV V, et al. Stencil com-putation optimization and auto-tuning on state-of-the-art multicore architectures[C]//Proceedings of the 2008 ACM/IEEE Conference on High Performance Computing, Austin, Nov 15-21, 2008. Piscataway: IEEE, 2008: 4.
[22] HAMMOUDA A, SIEGEL A R, SIEGEL S F. Dynamic barrier relaxations for explicit stencil computations: UDEL-CIS 2013/002[R]. 2013.
[23] CHEN T, DU Z, SUN N, et al. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning[J]. ACM SIGARCH Computer Architecture News, 2014, 42(1): 269-284.
[24] LIN D C. Citeseer, compiler support for predicated execution in superscalar processors[D]. Urbana: University of Illinois at Urbana-Champaign, 1992.
[25] MAHLKE S A, LIN D C, CHEN W Y, et al. Effective compiler support for predicated execution using the hyper-block[C]//Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Nov 1992. New York: ACM, 1992: 45-54.
[26] LATTNER C, ADVE V S. LLVM: a compilation framework for lifelong program analysis transformation[C]//Proceedings of the 2nd IEEE/ACM International Symposium on Code Generation and Optimization, San Jose, Mar 20-24, 2004. Washington: IEEE Computer Society, 2004: 75-86.
[27] STEFFEN B, KNOOP J, RüTHING O. Efficient code motion and an adaption to strength reduction[C]//LNCS 494: Proceedings of the 1991 International Joint Conference on Theory and Practice of Software Development, Brighton, Apr 8-12, 1991. Berlin, Heidelberg: Springer, 1991: 394-415.
[28] ROSEN B K, WEGMAN M N, ZADECK F K. Global value numbers and redundant computations[C]//Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Prin-ciples of Programming Languages, San Diego, Jan 10-13, 1988. New York: ACM, 1988: 12-27.
[29] ZHAO J, LI B J, NIE W, et al. AKG: automatic kernel generation for neural processing units using polyhedral transformations[C]//Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Jun 20-25, 2021. New York: ACM, 2021: 1233-1248.
[30] VASILEV V S, LEGALOV A I. Loop-invariant optimization in the pifagor language[J]. Automatic Control and Computer Sciences, 2018, 52(7): 843-849.
[31] SONG L T, KAVI K, CYTRON R. An unfolding-based loop optimization technique[C]//LNCS 2826: Proceedings of the 7th International Workshop on Software and Compilers for Embedded Systems, Vienna, Sep 24-26, 2003. Berlin,Heidelberg: Springer, 2003: 117-132.
[32] BACON D F, GRAHAM S L, SHARP O J. Compiler transformations for high-performance computing[J]. ACM Computing Surveys, 1994, 26(4): 345-420.
|