计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (10): 1677-1693.DOI: 10.3778/j.issn.1673-9418.1903027

• 高性能计算 • 上一篇    下一篇

基于Zynq7000 FPGA异构平台的YOLOv2加速器设计与实现

陈辰,柴志雷,夏珺   

  1. 1. 江南大学 物联网工程学院,江苏 无锡 214122
    2. 数学工程与先进计算国家重点实验室,江苏 无锡 214125
  • 出版日期:2019-10-01 发布日期:2019-10-15

Design and Implementation of YOLOv2 Accelerator Based on Zynq7000 FPGA Heterogeneous Platform

CHEN Chen, CHAI Zhilei, XIA Jun   

  1. 1. School of Internet of Things Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China
    2. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, Jiangsu 214125, China
  • Online:2019-10-01 Published:2019-10-15

摘要: 当前,卷积神经网络已在图像分类、目标检测等计算机视觉领域被广泛应用。然而,在前向推断阶段,许多实际应用往往具有低延时和严格的功耗限制。针对该问题,采用参数重排序、多通道数据传输等优化策略,设计并实现了一种基于FPGA的SIMD卷积神经网络加速器架构。以YOLOv2目标检测算法为例,介绍了将卷积神经网络模型映射到FPGA上的完整流程;对加速器的性能和资源耗费进行深入分析和建模,将实际传输延时考虑在内,缩小了加速器理论时延与实际时延的误差;改进了加速器架构中的输入和输出模块,有效提高了总线带宽的实际利用率。实验结果表明,在Zedboard上获得了30.15 GOP/s的性能,与Xeon E5-2620 v4 CPU相比,能效是其120.4倍,性能是其7.3倍;与双核ARM-A9 CPU相比,能效是其86倍,性能是其112.9倍。

关键词: 硬件加速器, 现场可编程门阵列(FPGA), 卷积神经网络(CNN), 高层次综合

Abstract: At present, convolutional neural network (CNN) has been widely used in image classification, object detection and other computer vision fields. However, in the forward inference stage, many practical applications often have low latency and strict power constraints. To solve this problem, an FPGA (field-programmable gate array) accelerator of CNN with the single instruction multiple data (SIMD) structure is designed and implemented using the optimization strategies such as parameter reordering and multi-channel data transmission. Taking YOLOv2 object detection algorithm as an example, the whole process of mapping CNN model to FPGA is described. The performance and resources of the accelerator are analyzed and modeled with the actual transmission delay being taken into account. It reduces the error between the theoretical and the actual delay of the accelerator. At the same time, the input and output modules in the accelerator are improved, which effectively improves the actual utilization of bus bandwidth. The experimental results show that a performance of 30.15 GOP/s is obtained on the Zedboard. Compared with the Xeon E5-2620 v4 CPU, 120.4 times of energy efficiency and 7.3 times of performance are obtained, and compared with the dual-core ARM-A9 CPU, 86 times of energy efficiency and 112.9 times of performance respectively are obtained.

Key words: hardware accelerator, field-programmable gate array (FPGA), convolutional neural network (CNN), high-level synthesis