Journal of Frontiers of Computer Science and Technology ›› 2024, Vol. 18 ›› Issue (2): 538-552.DOI: 10.3778/j.issn.1673-9418.2211075

• Computer Architecture • Previous Articles    

Hardware Acceleration of Number Theoretic Transform in zk-SNARK

ZHAO Haixu, CHAI Zhilei, HUA Pengcheng, WANG Feng, DING Dong   

  1. 1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China
    2. School of Internet of Things Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China
    3. Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Wuxi, Jiangsu 214122, China
  • Online:2024-02-01 Published:2024-02-01

zk-SNARK中数论变换的硬件加速方法研究

赵海旭,柴志雷,花鹏程,王锋,丁冬   

  1. 1. 江南大学 人工智能与计算机学院,江苏 无锡 214122
    2. 江南大学 物联网工程学院,江苏 无锡 214122
    3. 江苏省模式识别与计算智能工程实验室,江苏 无锡 214122

Abstract: The proof in zk-SNARK has a fixed length and can be verified quickly, promoting the application of zero-knowledge proof in areas such as digital signature, blockchain, distributed storage, and outsourced computing. However, the generation of proofs is time-consuming and frequently used. As a result, NTT (number theoretic transform), one of the most time-consuming parts in proof-generation, needs to be accelerated significantly. However, the existing general NTT hardware acceleration methods cannot meet the requirements of large-bitwidth and large-scale in zk-SNARK. To address this issue, this paper proposes a highly pipelined architecture for NTT. First of all, large-bitwidth modular arithmetic is optimized and low-latency Montgomery modular multiplication hardware unit is designed. And then, the large-scale NTT tasks are divided into smaller sub-tasks through two-dimensional partitioning, which improves the parallelism of NTT computation and eliminates the data dependence among sub-tasks, thus reali-zing the pipeline among sub-tasks. Finally, the “data reordering” technique is introduced among multiple rounds of butterfly operations in a sub-task, which effectively alleviates the memory access requirements, thus realizing the bottom-level pipeline in each sub-task, among butterfly operations with different step sizes. This architecture can be flexibly scaled to different scales of FPGAs. The accelerator is prototyped on the AMD-Xilinx Alveo U50 card (UltraScale+XCU50 FPGA). To balance computing efficiency and flexibility, the OpenCL equipped with high-level synthesis (HLS) is used to implement the system. The evaluation results show that the NTT module performs 1.95 times faster than the one in PipeZK and the accelerator achieves 27.98 and 1.74 times speedup, 6.9 and 6 times energy efficiency improvement than AMD Ryzen 9 5900X respectively, when it is integrated into the well-known ZKP open-source project, bellman.

Key words: field programmable gate array (FPGA), zero-knowledge succinct non-interactive arguments of knowledge (zk-SNARK), modular multiplication, number theoretic transform, hardware acceleration

摘要: 简洁非交互式零知识证明能够生成长度固定的证明并快速进行验证,极大地推动了零知识证明在数字签名、区块链及分布式存储等领域的应用。但其证明的生成过程极其耗时且需要被频繁调用,其中数论变换是证明生成过程的主要运算之一。然而现有的通用数论变换硬件加速方法难以满足其在简洁非交互式零知识证明中大规模、高位宽的要求。针对该问题,提出一种数论变换多级流水硬件计算架构。针对高位宽计算需求对高位模运算进行优化,设计了低时延蒙哥马利模乘单元;为了加速大规模计算,通过二维子任务划分将大规模数论变换任务划分为小规模独立子任务,并通过消除数据依赖实现了子任务间计算流水;在子任务多轮蝶形运算之间采用数据重排机制,有效缓解了访存需求并实现了不同步长蝶形运算间的计算流水。所提出的数论变换计算架构可以根据现场可编程门阵列(FPGA)片上资源灵活扩展,方便部署在不同规模的FPGA上以获得最大加速效果。所提出的硬件架构使用高层次综合(HLS)开发并基于OpenCL框架在AMD Xilinx Alveo U50实现了整套异构加速系统。实验结果表明,相比于PipeZK中的数论变换加速模块,该方法获得了1.95倍的加速比;在运行当前主流的简洁非交互式零知识证明开源项目bellman时,相比于AMD Ryzen 9 5900X单核及12核分别获得了27.98倍和1.74倍的加速比,并分别获得了6.9倍、6倍的能效提升。

关键词: 现场可编程门阵列(FPGA), 简洁非交互式零知识证明(zk-SNARK), 模乘, 数论变换, 硬件加速