计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (1): 127-139.DOI: 10.3778/j.issn.1673-9418.2107046

• 理论·算法 • 上一篇    下一篇

面向深度学习算子的循环不变式外提算法

梁佳利,华保健,吕雅帅,苏振宇   

  1. 1. 中国科学技术大学 软件学院,合肥 230000 
    2. 中科寒武纪科技股份有限公司,北京 100000
  • 出版日期:2023-01-01 发布日期:2023-01-01

Loop Invariant Code Motion Algorithm for Deep Learning Operators

LIANG Jiali, HUA Baojian, LYU Yashuai, SU Zhenyu   

  1. 1. School of Software Engineering, University of Science and Technology of China, Hefei 230000, China  
    2. Cambricon Technologies Corporation Limited, Beijing 100000, China
  • Online:2023-01-01 Published:2023-01-01

摘要: TVM是一个深度学习编译器,支持将TVM的领域专用语言即张量表达式定义的算子编译生成目标平台的代码,并在高级中间表示TVM IR上进行一系列优化。张量表达式对算子执行循环变换,产生与循环迭代变量相关的复杂表达式的计算,在多层嵌套循环内这些计算包含了大量的循环不变式。然而,传统的循环不变量外提技术不能判断不变量外提是否能带来额外收益,无法发现操作数顺序不同的循环不变表达式,不能处理嵌套的条件表达式,并且与目标平台编译器优化存在冲突等。由于这些挑战,传统的循环不变量外提算法无法直接用于深度学习编译器的优化,提出了一种融合深度学习代价函数和启发式策略的循环不变量外提算法。该算法基于深度学习编译器的高层中间表示,通过调整操作数顺序和简化嵌套条件表达式等方法规范化表达式。为了衡量优化的收益,在结合TVM IR和目标平台的特点的基础上,提出了一个新的面向深度学习的不变式外提代价指标函数。在开源编译器TVM 0.7版本上,通过新增优化遍的形式,具体实现了所介绍的算法以及代价函数。为评测算法的有效性,在Tesla P4的图形处理器(GPU)平台上对TVM TOPI的测试算子集中27个典型算子不同输入规模的511个测例进行了测试。实验结果表明47.6%的算子性能得到提升,最大加速比大于40.0%。

关键词: 深度学习编译器, 领域专用语言, 循环不变量外提, 中间表示

Abstract: TVM (tensor virtual machine) is a deep learning compiler, that translates the deep learning operators described by tensor expression to TVM IR (TVM intermediate representation) programs. After a series of operator-level optimizations on TVM IR, TVM generates the target code across diverse hardware back-ends. Tensor expression, a domain-specific language for tensor computation, performs loop transformation to operators. The result of loop transformation is a number of complicated expressions emerging in nested loop statements, which contain loop invariant code. However, in the context of deep learning applications, the traditional loop invariant code motion algorithm has severe limitations. Firstly, it’s difficult to determine the extra-benefit of moving certain invariant code out of loops. Secondly, it’s difficult to detect loop invariant code which has different orders of operands. Thirdly, it cannot process nested condition expressions well. Furthermore, there are conflicts with target hardware compiler optimizations. The application of loop invariant code motion technique is constrained by the aforementioned problems. In this paper, a new loop invariant code motion algorithm is proposed, which takes deep learning application characteristics into consideration in a heuristics way. The algorithm normalizes the program by manipulating the expression operands and simplifying the nested condition expression. This paper introduces a new cost model, which evaluates the cost of moving certain loop invariant code while the characteristics of TVM IR and target hardware back-ends are fully considered. The algorithm is implemented as a registered TVM pass on open-source compiler TVM version 0.7. To testify the effectiveness and correctness of this algorithm, experiments are conducted on TVM TOPI benchmark with 27 operators and 511 test cases under different input. Experimental results show that this algorithm improves 47.6% of operators’ performance, and achieves speedups up to 40.0%.

Key words: deep learning compiler, domain specific language, loop invariant code motion, intermediate representation