计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (2): 315-326.DOI: 10.3778/j.issn.1673-9418.1912029

• 人工智能 • 上一篇    下一篇

基于FPGA的油棕检测和硬件加速设计及实现

袁鸣,柴志雷,甘霖   

  1. 1. 江南大学 物联网工程学院 物联网技术应用教育部工程研究中心,江苏 无锡 214122
    2. 国家超级计算无锡中心,江苏 无锡 214122
    3. 清华大学 计算机科学与技术系,北京 100084
  • 出版日期:2021-02-01 发布日期:2021-02-01

FPGA-Based Hardware Accelerator Design and Implementation of Oil Palm Detection

YUAN Ming, CHAI Zhilei, GAN Lin   

  1. 1. Engineering Research Center of Internet of Things Technology Applications Ministry of Education, School of Internet of Things Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China
    2. National Supercomputing Center in Wuxi, Wuxi, Jiangsu 214122, China
    3. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Online:2021-02-01 Published:2021-02-01

摘要:

针对深度学习在高分辨率遥感图像下棕榈树检测方面所面临的准确率不高和检测效率低下的问题,从算法优化和异构硬件平台加速两方面提出一种有效可靠的解决办法。以YOLOv3目标检测算法为例,采用扩大特征选择、加大多尺度特征融合的优化策略,提高了算法对高分辨率的棕榈树的检测准确度。在前向推理过程中,许多应用场景在要求模型高性能的同时往往会有严格的功耗限制。针对这个问题,采用权重整形8位量化和计算核心复用的优化策略,设计了一个基于SIMD的高效卷积计算引擎。此外,对输入模块进行了加速改进,通过对输入图片进行维度变化、向量化处理后,以写队列的方式传送给输入模块,提高了总线带宽的利用率。实验结果表明,经过算法优化后的模型准确率达到了97.84%,在基于Intel Arria10的异构硬件平台上可以获得1.4 TOPS性能,与i9-9980XE CPU相比,性能是它的7.51倍,能效是其33.02倍,与Nvidia推理端专用加速器P40比,能效是其1.2倍。

关键词: 现场可编程逻辑门阵列(FPGA), 改进YOLOv3, 棕榈树, 硬件加速器

Abstract:

Aiming at the problems of low accuracy and low detection efficiency of high-resolution oil palm detection in deep learning, an effective and reliable solution is proposed from two aspects of algorithm optimization and heterogeneous hardware platform acceleration. Taking YOLOv3 object detection algorithm as an example, the optimization strategy of expanding feature selection and increasing multi-scale feature fusion is adopted to improve the detection accuracy of the algorithm for high-resolution oil palm. In addition, in the process of inference, plenty of applications require high performance models with strict power consumption limits. In order to solve this problem, taking the strategy of integer 8-bits quantitative weights and computational units reuse, this paper designs a high efficiency convolution computational engine based on SIMD. At the same time, through the strategy of the dimension change of the input image, vectorization, transmission to the input module in the form of written queue, this paper increases the efficiency of bus bandwidth greatly and accelerates the input module well. The experimental results show that the accuracy of the improved algorithm model is 97.84%, and a performance of 1.4 TOPS is obtained on the FPGA platform of Intel Arria 10. Compared with the i9-9980XE CPU, 7.51 times of the perform-ance and 33.02 times of energy efficiency are obtained. It is 1.2 times more efficient than Nvidia's dedicated P40 accelerator.

Key words: field-programmable gate array (FPGA), improved YOLOv3, oil palm, hardware accelerator