e-GPU: An Open-Source and Configurable RISC-V Graphic Processing Unit for TinyAI Applications

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Existing GPU parallelism remains underutilized in ultra-low-power edge AI (TinyAI) devices. Method: This paper proposes an open-source, configurable RISC-V embedded GPU (e-GPU) architecture and a lightweight programming framework, Tiny-OpenCL. It introduces the first parameterized RISC-V GPU microarchitecture tailored for TinyAI, featuring custom instruction extensions; designs a resource-efficient Tiny-OpenCL runtime and scheduling mechanism; and implements an accelerated processing unit (APU) integrating the e-GPU with the X-HEEP heterogeneous platform in TSMC 16nm. Results: At 300 MHz and 0.8 V, the APU consumes ≤28 mW with only 2.5× area overhead. Compared to the baseline, it achieves up to 15.1× speedup and 3.1× energy efficiency improvement, demonstrating holistic optimization across area, power, and performance.

Technology Category

Application Category

📝 Abstract

Graphics processing units (GPUs) excel at parallel processing, but remain largely unexplored in ultra-low-power edge devices (TinyAI) due to their power and area limitations, as well as the lack of suitable programming frameworks. To address these challenges, this work introduces embedded GPU (e-GPU), an open-source and configurable RISC-V GPU platform designed for TinyAI devices. Its extensive configurability enables area and power optimization, while a dedicated Tiny-OpenCL implementation provides a lightweight programming framework tailored to resource-constrained environments. To demonstrate its adaptability in real-world scenarios, we integrate the e-GPU with the eXtendible Heterogeneous Energy-Efficient Platform (X-HEEP) to realize an accelerated processing unit (APU) for TinyAI applications. Multiple instances of the proposed system, featuring varying e-GPU configurations, are implemented in TSMC's 16 nm SVT CMOS technology and are operated at 300 MHz and 0.8 V. Their area and leakage characteristics are analyzed to ensure alignment with TinyAI constraints. To assess both runtime overheads and application-level efficiency, we employ two benchmarks: General Matrix Multiply (GeMM) and bio-signal processing (TinyBio) workloads. The GeMM benchmark is used to quantify the scheduling overhead introduced by the Tiny-OpenCL framework. The results show that the delay becomes negligible for matrix sizes larger than 256x256 (or equivalent problem sizes). The TinyBio benchmark is then used to evaluate performance and energy improvements in the baseline host. The results demonstrate that the high-range e-GPU configuration with 16 threads achieves up to a 15.1x speed-up and reduces energy consumption by up to 3.1x, while incurring only a 2.5x area overhead and operating within a 28 mW power budget.

Problem

Research questions and friction points this paper is trying to address.

Addressing power and area limitations for GPUs in TinyAI devices

Providing a configurable RISC-V GPU platform for edge applications

Enabling lightweight programming frameworks for resource-constrained environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source configurable RISC-V GPU platform

Lightweight Tiny-OpenCL programming framework

Energy-efficient TSMC 16nm CMOS implementation

🔎 Similar Papers

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow