计算机科学与探索

• 学术研究 •    下一篇

高能效CNN加速器设计

喇超,李淼,张峰,张翠婷   

  1. 1. 北京中科格励微科技有限公司, 北京 100190
    2. 中国科学院自动化研究所 国家专用集成电路设计工程技术研究中心, 北京 100190

MSNAP: A High Efficiencies CNN Accelerator

LA Chao, LI Miao, ZHANG Feng, ZHANG Cuiting   

  1. 1. GL-Microelectronics Technology Co., Ltd., Beijing 100190, China
    2. National ASIC Design Engineering Technology Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

摘要: 当前,卷积神经网络(Convolutional Neural Networks,CNN)被广泛应用于图片分类、目标检测与识别以及自然语言理解等领域。随着卷积神经网络的复杂度和规模不断增加,对硬件部署带来了极大的挑战,尤其是面对嵌入式应用领域的低功耗、低时延需求,大多数现有平台存在高功耗,控制复杂的问题。为此,该文以优化加速器能效为目标,对决定系统能效的关键因素进行分析,以缩放计算精度和降低系统频率为主要出发点,研究极低比特下全网络统一量化方法,设计一种高能效CNN加速器,该加速器以1比特权重和4比特激活值的轻量化计算单元为基础,构建了128×128空间并行加速阵列结构,由于空间并行度高,因此整个系统采用低运行频率;同时,采用权重固定、特征图广播的数据传播方式,有效减少权重、特征图的数据搬移次数,达到降低功耗,提高系统能效比的目的。通过22nm工艺流片验证,结果表明,在20MHz频率下,峰值算力达到10.54 TOPS(Tera Operations Per Second,TOPS),能效比达到64.317 TOPS/W,相较同类型加速器,该文加速器能效比有5倍的提升。同时,部署的目标检测网络能够达到60 FPS(Frames Per Second,FPS)的检测速率,完全满足嵌入式应用需求。

关键词: 加速器, 卷积神经网络(CNN), 轻量化神经元计算单元(NCU), 空间并行加速阵列(MSNAP), 分支卷积量化

Abstract: Recently, the Convolutional Neural Network (CNN) has been widely used in image classification, object detection and recognition, and natural language processing. Due to the increasing complexity and scale of CNN, hardware realization is facing greater challenges, particularly for embedded systems with low power and latency requirements. Among them, the high throughput requirement coming from processing hundreds of filters in high-dimensional convolutions is foremost. Although high-parallel compute arrays address the high throughput requirement, energy consumption from the huge amount of data movement and resource consumption from high-parallel arrays remain unacceptable for embedded application scenarios,Not to mention the increased complexity in control that comes with it. This paper aims to optimize the energy efficiency of accelerators by analyzing the key factors that determine system energy efficiency. Therefore, focusing on reducing energy consumption and resource consumption of the system, this paper proposes a high-performance CNN accelerator, called MSNAP. This is realized by 128×128 high-parallel compute arrays and bases on a unified quantization method for the entire network with extremely low bit-width. To ensure that the data of input images and last layer are the same bit-width as the middle layers, we adopt thermometer codes and Branch Convolution Quantization methods, that allows integrating all memory on-chip and make the implementation of a large-scale array easier. MSNAP features an efficiencies lightweight computation neuron which composes of 1152 multiplicative cells. In addition to compressed memory storage, the unified quantization method simplifies multiplications to Multiplexer, which drastically reduces resource consumption. A weight-stationary, data-parallel dataflow and the optimization of the pooling layer improve data utilizations, which minimize data movement energy consumption of MSNAP. To evaluate the efficiencies of this design, and demonstrate the high-parallel compute arrays, a CNN chip in 22-nm CMOS was realized. Experiments show that at the frequency of 20MHz, the chip offers a peak throughput of 10.54 Tera Operations Per second(TOPS) and the efficiencies up to 64.317TOPS/W, the corresponding efficiencies amounts to a 5x improvement over the previous benchmark on CIFAR-10. Meanwhile, the design operates at 60frames/s (FPS) on YOLO, fully meets the needs of embedded applications.

Key words: Accelerator, Convolutional Neural Networks (CNN), lightweight computation neuron (NCU), high-parallel array (MSNAP), Branch convolution quantization