🤖 AI Summary
This work addresses the challenge of efficiently deploying convolutional neural networks in resource-constrained embedded environments, where high computational and memory demands hinder practical application. To overcome this, the authors present a lightweight object detection system built upon FPGA implementation of YOLOv3-Tiny, incorporating algorithm-hardware co-optimization techniques including low-bit quantization, batch normalization folding, and lookup-table-based activation mapping. A pipelined architecture and on-chip caching mechanism are further designed to minimize off-chip memory access. Evaluated on the ZYNQ-XC7Z035 platform, the proposed system achieves an inference latency of 0.211 seconds—representing a 75.58% speedup over the baseline—while delivering an energy efficiency of 10.11 GOPS/W and reducing hardware resource utilization by up to 51.94%.
📝 Abstract
Computational complexity and storage requirements are crucial factors influencing the performance and efficiency of convolutional neural networks (CNNs) in resource-constrained environments. This paper presents a high-performance embedded target detection system based on FPGA and YOLOv3-Tiny, specifically designed for embedded artificial intelligence applications. By integrating lightweight CNN optimization techniques with hardware accelerator design, significant improvements are made in both computational efficiency and resource utilization. Key optimizations, including low-bit quantization, batch normalization fusion, and table lookup mapping, reduce model parameters and computational complexity. Additionally, an FPGA hardware accelerator with a pipelined architecture is developed to enhance the efficiency of convolution operations while minimizing off-chip data transmission through modular design and on-chip cache optimization. On the ZYNQ-XC7Z035 platform, the system achieves an inference latency of 0.211 seconds, outperforming comparable designs by 75.58% in speed. The system achieves an power efficiency of 10.11 GOPS/W, surpassing comparable designs by at least 29.45%. Furthermore, hardware resource utilization is reduced by up to 51.94% compared to similar systems. This study offers innovative design methodologies and practical application examples for the efficient deployment of deep learning models on embedded platforms.