zk-SNARK中数论变换的硬件加速方法研究

doi:10.3778/j.issn.1673-9418.2211075

摘要/Abstract

摘要： 简洁非交互式零知识证明能够生成长度固定的证明并快速进行验证，极大地推动了零知识证明在数字签名、区块链及分布式存储等领域的应用。但其证明的生成过程极其耗时且需要被频繁调用，其中数论变换是证明生成过程的主要运算之一。然而现有的通用数论变换硬件加速方法难以满足其在简洁非交互式零知识证明中大规模、高位宽的要求。针对该问题，提出一种数论变换多级流水硬件计算架构。针对高位宽计算需求对高位模运算进行优化，设计了低时延蒙哥马利模乘单元；为了加速大规模计算，通过二维子任务划分将大规模数论变换任务划分为小规模独立子任务，并通过消除数据依赖实现了子任务间计算流水；在子任务多轮蝶形运算之间采用数据重排机制，有效缓解了访存需求并实现了不同步长蝶形运算间的计算流水。所提出的数论变换计算架构可以根据现场可编程门阵列（FPGA）片上资源灵活扩展，方便部署在不同规模的FPGA上以获得最大加速效果。所提出的硬件架构使用高层次综合（HLS）开发并基于OpenCL框架在AMD Xilinx Alveo U50实现了整套异构加速系统。实验结果表明，相比于PipeZK中的数论变换加速模块，该方法获得了1.95倍的加速比；在运行当前主流的简洁非交互式零知识证明开源项目bellman时，相比于AMD Ryzen 9 5900X单核及12核分别获得了27.98倍和1.74倍的加速比，并分别获得了6.9倍、6倍的能效提升。

关键词: 现场可编程门阵列（FPGA）, 简洁非交互式零知识证明（zk-SNARK）, 模乘, 数论变换, 硬件加速

Abstract: The proof in zk-SNARK has a fixed length and can be verified quickly, promoting the application of zero-knowledge proof in areas such as digital signature, blockchain, distributed storage, and outsourced computing. However, the generation of proofs is time-consuming and frequently used. As a result, NTT (number theoretic transform), one of the most time-consuming parts in proof-generation, needs to be accelerated significantly. However, the existing general NTT hardware acceleration methods cannot meet the requirements of large-bitwidth and large-scale in zk-SNARK. To address this issue, this paper proposes a highly pipelined architecture for NTT. First of all, large-bitwidth modular arithmetic is optimized and low-latency Montgomery modular multiplication hardware unit is designed. And then, the large-scale NTT tasks are divided into smaller sub-tasks through two-dimensional partitioning, which improves the parallelism of NTT computation and eliminates the data dependence among sub-tasks, thus reali-zing the pipeline among sub-tasks. Finally, the “data reordering” technique is introduced among multiple rounds of butterfly operations in a sub-task, which effectively alleviates the memory access requirements, thus realizing the bottom-level pipeline in each sub-task, among butterfly operations with different step sizes. This architecture can be flexibly scaled to different scales of FPGAs. The accelerator is prototyped on the AMD-Xilinx Alveo U50 card (UltraScale+XCU50 FPGA). To balance computing efficiency and flexibility, the OpenCL equipped with high-level synthesis (HLS) is used to implement the system. The evaluation results show that the NTT module performs 1.95 times faster than the one in PipeZK and the accelerator achieves 27.98 and 1.74 times speedup, 6.9 and 6 times energy efficiency improvement than AMD Ryzen 9 5900X respectively, when it is integrated into the well-known ZKP open-source project, bellman.

Key words: field programmable gate array (FPGA), zero-knowledge succinct non-interactive arguments of knowledge (zk-SNARK), modular multiplication, number theoretic transform, hardware acceleration

赵海旭, 柴志雷, 花鹏程, 王锋, 丁冬. zk-SNARK中数论变换的硬件加速方法研究[J]. 计算机科学与探索, 2024, 18(2): 538-552.

ZHAO Haixu, CHAI Zhilei, HUA Pengcheng, WANG Feng, DING Dong. Hardware Acceleration of Number Theoretic Transform in zk-SNARK[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(2): 538-552.

参考文献

[1] GOLDWASSER S, MICALI S, RACKOFF C. The knowledge complexity of interactive proof-systems[M]//Providing Sound Foundations for Cryptography: on the Work of Shafi Goldwasser and Silvio Micali. New York: ACM, 2019: 203-225.
[2] BITANSKY N, CANETTI R, CHIESA A, et al. From extractable collision resistance to succinct non-interactive arguments of knowledge, and back again[C]//Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, Cambridge, Jan 8-10, 2012. New York: ACM, 2012: 326-349.
[3] BITANSKY N, CANETTI R, CHIESA A, et al. The hunting of the SNARK[J]. Journal of Cryptology, 2017, 30(4): 989-1066.
[4] BLUM M, FELDMAN P, MICALI S. Non-interactive zero-knowledge and its applications[M]//Providing Sound Foundations for Cryptography: on the Work of Shafi Goldwasser and Silvio Micali. New York: ACM, 2019: 329-349.
[5] 李威翰, 张宗洋, 周子博,等. 简洁非交互零知识证明综述[J]. 密码学报, 2022, 9(3): 379-447.
LI W H, ZHANG Z Y, ZHOU Z B, et al. An overview on succinct non-interactive zero-knowledge proofs[J]. Journal of Cryptologic Research, 2022, 9(3): 379-447.
[6] DELIGNAT-LAVAUD A, FOURNET C, KOHLWEISS M, et al. Cinderella: turning shabby X.509 certificates into elegant anonymous credentials with the magic of verifiable computation[C]//Proceedings of the 2016 IEEE Symposium on Security and Privacy, San Jose, May 23-25, 2016. Piscataway: IEEE, 2016: 235-254.
[7] 单进勇, 高胜. 区块链理论研究进展[J]. 密码学报, 2018, 5(5):484-500.
SHAN J Y, GAO S. Reasearch progress on theory of blockchains[J]. Journal of Cryptologic Research, 2018, 5(5): 484-500.
[8] DANEZIS G, FOURNET C, KOHLWEISS M, et al. Pinocchio coin: building zerocoin from a succinct pairing-based proof system[C]//Proceedings of the 1st ACM Workshop on Language Support for Privacy-Enhancing Technologies, Berlin, Nov 4, 2013. New York: ACM, 2013: 27-30.
[9] SASSON E B, CHIESA A, GARMAN C, et al. Zerocash: decentralized anonymous payments from bitcoin[C]//Proceedings of the 2014 IEEE Symposium on Security and Privacy, San Jose, May 18-21. Piscataway: IEEE, 2014: 459-474.
[10] HUANG H S, CHANG T S, WU J Y. A secure file sharing system based on IPFS and blockchain[C]//Proceedings of the 2020 2nd International Electronics Communication Conference, Kuala Lumpur, Aug 12-14, 2020. New York: ACM, 2020: 96-100.
[11] ZHANG Y, GENKIN D, KATZ J, et al. vSQL: verifying arbitrary SQL queries over dynamic outsourced databases[C]//Proceedings of the 2017 IEEE Symposium on Security and Privacy, San Jose, May 22-24, 2017. Piscataway: IEEE, 2017: 863-880.
[12] BENET J. IPFS-content addressed, versioned, P2P file system[J]. arXiv:1407.3561, 2014.
[13] ZHANG Y, WANG S, ZHANG X, et al. Pipezk: accelerating zero-knowledge proof with a pipelined architecture[C]//2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture, Spain, Jun 14-19, 2021. Piscataway: IEEE, 2021: 416-428.
[14] GROTH J, MALLER M. Snarky signatures: minimal signatures of knowledge from simulation-extractable SNARKs[C]//Proceedings of the 37th Annual International Cryptology Conference, Santa Barbara, Aug 20-24, 2017. Cham: Springer, 2017: 581-612.
[15] BEN-SASSON E, BENTOV I, HORESH Y, et al. Scalable, transparent, and post-quantum secure computational integrity[J]. Cryptology ePrint Archive, 2018.
[16] BOWE S, GABIZON A. Making Groth??s zk-SNARK simulation extractable in the random oracle model[J]. Cryptology ePrint Archive, 2018.
[17] 黄平, 梁伟洁. 一种基于QAP问题的ZK-SNARK新协议[J]. 华南理工大学学报(自然科学版), 2021, 49(1): 1-9.
HUANG P, LIANG W J. A new ZK-SNARK protocol based on QAP[J]. Journal of South China University of Technology (Natural Science Edition), 2021, 49(1): 1-9.
[18] XIE T, ZHANG J, ZHANG Y, et al. Libra: succinct zero-knowledge proofs with optimal prover computation[C]//Proceedings of the 39th Annual International Cryptology Conference, Santa Barbara, Aug 18-22, 2019. Cham: Springer, 2019: 733-764.
[19] MALLER M, BOWE S, KOHLWEISS M, et al. Sonic: zero-knowledge SNARKs from linear-size universal and updatable structured reference strings[C]//Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, Nov 11-15, 2019. New York: ACM, 2019: 2111-2128.
[20] HALEPLIDIS E, TSAKOULIS T, EL-KADY A, et al. Studying OpenCL-based number theoretic transform for heterogeneous platforms[C]//Proceedings of the 2021 24th Euromicro Conference on Digital System Design, Palermo, Sep 1-3, 2021. Piscataway: IEEE, 2021: 339-346.
[21] KIM S, LEE K, CHO W, et al. Hardware architecture of a number theoretic transform for a bootstrappable RNS-based homomorphic encryption scheme[C]//Proceedings of the 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines, Fayetteville, May 3-6, 2020. Piscataway: IEEE, 2020: 56-64.
[22] ?ZTüRK E, DOR?Z Y, SAVA? E, et al. A custom accelerator for homomorphic encryption applications[J]. IEEE Transactions on Computers, 2016, 66(1): 3-16.
[23] 周慧凯. 同态加密的硬件卸载及其在隐私保护计算中的应用[J]. 小型微型计算机系统, 2021, 42(3): 595-600.
ZHOU H K. Homomorphic encryption offloading and its application in privacy-preserving computing[J]. Journal of Chinese Computer Systems, 2021, 42(3): 595-600.
[24] CHEN D D, MENTENS N, VERCAUTEREN F, et al. High-speed polynomial multiplication architecture for ring-LWE and SHE cryptosystems[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2014, 62(1): 157-166.
[25] KALES D, RAMACHER S, RECHBERGER C, et al. Efficient FPGA implementations of LowMC and picnic[C]//Proceedings of the Cryptographers’ Track at the RSA Conference, San Francisco, Feb 24-28, 2020. Cham: Springer, 2020: 417-441.
[26] AGRAWAL R, BU L, EHRET A, et al. Open-source FPGA implementation of post-quantum cryptographic hardware primitives[C]//Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications Barcelona, Sep 8-12, 2019. Piscataway: IEEE, 2019: 211-217.
[27] MERT A C, ?ZTüRK E, SAVA? E. Design and implementation of a fast and scalable NTT-based polynomial multiplier architecture[C]//Proceedings of the 2019 22nd Euromicro Conference on Digital System Design, Kallithea, Aug 28-30, 2019. Piscataway: IEEE, 2019: 253-260.
[28] RIAZI M S, LAINE K, PELTON B, et al. HEAX: an architecture for computing on encrypted data[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Mar 16-20, 2020. New York: ACM, 2020: 1295-1309.
[29] 沈耀坡, 梁煜, 张为. 一种高性能快速傅里叶变换的硬件设计[J]. 西安电子科技大学学报, 2018, 45(3): 63-67.
SHEN Y P, LIANG Y, ZHANG W. Hardware efficient fast Fourier transform architecture[J]. Journal of Xidian University, 2018, 45(3): 63-67.
[30] 谢星, 黄新明, 孙玲, 等. 大整数乘法器的FPGA设计与实现[J]. 电子与信息学报, 2019, 41(8): 1855-1860.
XIE X, HUANG X M, SUN L, et al. FPGA design and implementation of large integer multiplier[J]. Journal of Electronics & Information Technology, 2019, 41(8): 1855-1860.
[31] FILECOIN. Bellperson: GPU parallel acceleration for zk-SNARK[EB/OL]. (2020) [2022-10-20]. https://github.com/filecoin-project/bellperson.
[32] GROTH J. On the size of pairing-based non-interactive arguments[C]//Proceedings of the 35th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Vienna, May 8-12. Cham: Springer, 2016: 305-326.
[33] FILECOIN. Bellman: zk-SNARK library[EB/OL]. (2018) [2022-10-20]. https://github.com/zkcrypto/bellman.
[34] 崔西宁, 杨经纬, 叶宏, 等. 椭圆曲线密码的优化设计方法[J]. 西安电子科技大学学报, 2015, 42(1): 69-74.
CUI X N, YANG J W, YE H, et al. Optimized design method on elliptic curve cryptography[J]. Journal of Xidian University, 2015, 42(1): 69-74.
[35] MONTGOMERY P L. Modular multiplication without trial division[J]. Mathematics of Computation, 1985, 44(170): 519-521.
[36] ?ZTüRK E. Modular multiplication algorithm suitable for low-latency circuit implementations[J]. Cryptology ePrint Archive, 2019.
[37] KARATSUBA A. Multiplication of multidigit numbers on automata[J]. Soviet Physics Doklady, 1963, 7: 595-596.
[38] CHOW G C T, EGURO K, LUK W, et al. A Karatsuba-based Montgomery multiplier[C]//Proceedings of the 2010 International Conference on Field Programmable Logic and Applications, Milano, Aug 31-Sep 2, 2010. Piscataway: IEEE, 2010: 434-437.
[39] SZE T W. Sch?nhage-Strassen algorithm with Mapreduce for multiplying terabit integers[C]//Proceedings of the 2011 International Workshop on Symbolic-Numeric Computation, San Jose, Jun 7-9, 2012. New York: ACM, 2012: 54-62.
[40] KAWAMURA K, YANAGISAWA M, TOGAWA N. A loop structure optimization targeting high-level synthesis of fast number theoretic transform[C]//Proceedings of the 2018 19th International Symposium on Quality Electronic Design, Santa Clara, Mar 13-14, 2018. Piscataway: IEEE, 2018: 106-111.
[41] OZCAN E, AYSU A. High-level synthesis of number-theoretic transform: a case study for future Cryptosystems[J]. IEEE Embedded Systems Letters, 2019, 12(4): 133-136.