Journal of Frontiers of Computer Science and Technology ›› 2025, Vol. 19 ›› Issue (9): 2520-2531.DOI: 10.3778/j.issn.1673-9418.2411020

• Practice ·Applications • Previous Articles     Next Articles

Design of Energy-Efficient CNN Accelerator

LA Chao, LI Miao, ZHANG Feng, ZHANG Cuiting   

  1. 1. Beijing GL-Microelectronics Technology Co., Ltd., Beijing 100190, China
    2. National ASIC Design Engineering Technology Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
  • Online:2025-09-01 Published:2025-09-01

高能效CNN加速器设计

喇超,李淼,张峰,张翠婷   

  1. 1. 北京中科格励微科技有限公司,北京 100190
    2. 中国科学院 自动化研究所 国家专用集成电路设计工程技术研究中心,北京 100190

Abstract: Recently, the convolutional neural network (CNN) has found widespread applications in domains such as image classification, object detection and recognition, and natural language processing. With the mounting complexity and scale of CNN, significant challenges emerge for hardware deployment. Particularly for embedded systems with low power and latency requirements, most existing platforms grapple with high power consumption and intricate control issues. To this end, this paper aims to optimize the energy efficiency of accelerators by analyzing critical factors governing this metric. Guided by precision scaling and system frequency reduction, this paper explores a unified quantization method with extremely low-bit quantized weights for all layers and devises a high energy efficiency CNN accelerator, called MSNAP (modified simple neural acceleration processor). The accelerator implements a 128×128 high-parallel compute array, built upon lightweight computation neuron featuring 1-bit weights and 4-bit activations. Due to its high spatial parallelism, the system thus employs low-frequency operation. Simultaneously, a weight-stationary, feature map broadcasting data propagation scheme is employed, which effectively reduces data movement of weight and feature maps, thereby achieving the ultimate goal of lowering power consumption and improving efficiencies of system. To evaluate this design, a CNN chip in 22-nm CMOS (complementary metal-oxide-semiconductor) is realized. Experiments show that at the frequency of 20 MHz, the chip offers a peak throughput of 10.54 TOPS (tera operations per second) with energy-efficiency of 64.317 TOPS/W. Compared with related works over the previous benchmark on CIFAR-10, the proposed accelerator demonstrates a 5x improvement in energy-efficiency. Meanwhile, the design operates at 60 FPS on YOLO, fully meeting the needs of embedded applications.

Key words: accelerator, convolutional neural network (CNN), lightweight computation neuron (NCU), modified simple neural acceleration processor (MSNAP), branch convolution quantization (BCQ)

摘要: 当前,卷积神经网络(CNN)被广泛应用于图片分类、目标检测与识别以及自然语言理解等领域。随着卷积神经网络的复杂度和规模不断增加,对硬件部署带来了极大的挑战,尤其是面对嵌入式应用领域的低功耗、低时延需求,大多数现有平台存在高功耗、控制复杂的问题。为此,以优化加速器能效为目标,对决定系统能效的关键因素进行分析,以缩放计算精度和降低系统频率为主要出发点,研究极低比特下全网络统一量化方法,设计一种高能效CNN加速器MSNAP。该加速器以1比特权重和4比特激活值的轻量化计算单元为基础,构建了128×128空间并行加速阵列结构,由于空间并行度高,整个系统采用低运行频率。同时,采用权重固定、特征图广播的数据传播方式,有效减少权重、特征图的数据搬移次数,达到降低功耗、提高系统能效比的目的。通过22 nm工艺流片验证,结果表明,在20 MHz频率下,峰值算力达到10.54 TOPS,能效比达到64.317 TOPS/W,相较同类型加速器在采用CIFAR-10数据集的分类网络中,该加速器能效比有5倍的提升。部署的目标检测网络YOLO能够达到60 FPS的检测速率,完全满足嵌入式应用需求。

关键词: 加速器, 卷积神经网络(CNN), 轻量化神经元计算单元(NCU), MSNAP, 分支卷积量化(BCQ)