Bare-Metal RISC-V + NVDLA SoC for Efficient Deep Learning Inference

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the low inference efficiency and high system overhead of deep learning on edge devices, this paper proposes a software-hardware co-optimized SoC architecture: a tightly coupled RISC-V processor (Codasip uRISC_V, 4-stage pipeline) integrated with an NVDLA hardware accelerator, implemented bare-metal on an AMD ZCU102 FPGA. Custom assembly code generation and OS-free execution eliminate scheduling and memory management overhead. The architecture supports deployment of LeNet-5, ResNet-18, and ResNet-50, achieving inference latencies of 4.8 ms, 16.2 ms, and 1.1 s, respectively, at 100 MHz. Key contributions are: (i) the first FPGA implementation of a low-overhead, tightly coupled RISC-V–NVDLA architecture; and (ii) experimental validation—under bare-metal execution—that high-energy-efficiency, low-latency edge inference is feasible, offering a scalable, open-source acceleration solution for resource-constrained scenarios.

Technology Category

Application Category

📝 Abstract

This paper presents a novel System-on-Chip (SoC) architecture for accelerating complex deep learning models for edge computing applications through a combination of hardware and software optimisations. The hardware architecture tightly couples the open-source NVIDIA Deep Learning Accelerator (NVDLA) to a 32-bit, 4-stage pipelined RISC-V core from Codasip called uRISC_V. To offload the model acceleration in software, our toolflow generates bare-metal application code (in assembly), overcoming complex OS overheads of previous works that have explored similar architectures. This tightly coupled architecture and bare-metal flow leads to improvements in execution speed and storage efficiency, making it suitable for edge computing solutions. We evaluate the architecture on AMD's ZCU102 FPGA board using NVDLA-small configuration and test the flow using LeNet-5, ResNet-18 and ResNet-50 models. Our results show that these models can perform inference in 4.8 ms, 16.2 ms and 1.1 s respectively, at a system clock frequency of 100 MHz.

Problem

Research questions and friction points this paper is trying to address.

Accelerating deep learning models for edge computing

Overcoming OS overheads with bare-metal code

Improving execution speed and storage efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bare-metal RISC-V NVDLA SoC architecture

Hardware-software co-design optimizations

Assembly code generation eliminating OS overhead

🔎 Similar Papers

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow