🤖 AI Summary
To address the low inference efficiency and high system overhead of deep learning on edge devices, this paper proposes a software-hardware co-optimized SoC architecture: a tightly coupled RISC-V processor (Codasip uRISC_V, 4-stage pipeline) integrated with an NVDLA hardware accelerator, implemented bare-metal on an AMD ZCU102 FPGA. Custom assembly code generation and OS-free execution eliminate scheduling and memory management overhead. The architecture supports deployment of LeNet-5, ResNet-18, and ResNet-50, achieving inference latencies of 4.8 ms, 16.2 ms, and 1.1 s, respectively, at 100 MHz. Key contributions are: (i) the first FPGA implementation of a low-overhead, tightly coupled RISC-V–NVDLA architecture; and (ii) experimental validation—under bare-metal execution—that high-energy-efficiency, low-latency edge inference is feasible, offering a scalable, open-source acceleration solution for resource-constrained scenarios.
📝 Abstract
This paper presents a novel System-on-Chip (SoC) architecture for accelerating complex deep learning models for edge computing applications through a combination of hardware and software optimisations. The hardware architecture tightly couples the open-source NVIDIA Deep Learning Accelerator (NVDLA) to a 32-bit, 4-stage pipelined RISC-V core from Codasip called uRISC_V. To offload the model acceleration in software, our toolflow generates bare-metal application code (in assembly), overcoming complex OS overheads of previous works that have explored similar architectures. This tightly coupled architecture and bare-metal flow leads to improvements in execution speed and storage efficiency, making it suitable for edge computing solutions. We evaluate the architecture on AMD's ZCU102 FPGA board using NVDLA-small configuration and test the flow using LeNet-5, ResNet-18 and ResNet-50 models. Our results show that these models can perform inference in 4.8 ms, 16.2 ms and 1.1 s respectively, at a system clock frequency of 100 MHz.