Towards Zero-Stall Matrix Multiplication on Energy-Efficient RISC-V Clusters for Machine Learning Acceleration

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the matrix multiplication efficiency bottleneck in RISC-V processor clusters under ML workloads—caused by control-flow overhead and L1 memory access conflicts—this paper proposes a stall-free microarchitecture. Our method introduces three key innovations: (1) a novel zero-overhead hardware mechanism for nested-loop offloading; (2) a bank-aware dual-buffered interconnect; and (3) a conflict-aware dual-buffered L1 memory subsystem, all integrated with custom RISC-V instruction extensions. Crucially, the design preserves full programmability while achieving near-ideal computational utilization. Experimental evaluation demonstrates sustained compute utilization of 96.1%–99.4%, delivering 11% higher performance and 8% better energy efficiency compared to state-of-the-art RISC-V clusters. Its energy efficiency is only 12% lower than that of domain-specific accelerators, while supporting diverse general-purpose ML workloads.

Technology Category

Application Category

📝 Abstract
The growing computational demands of machine learning (ML) workloads have driven the design of ML accelerators aiming at an optimal tradeoff between efficiency and flexibility. A widely explored architecture for flexible ML accelerators is based on clusters of lightweight instruction processors sharing multi-banked L1 memory, augmented with specialized instruction extensions for key ML-related computations, such as matrix multiplication (matmul). However, instruction extensions should be coupled with microarchitectural optimizations that remove inefficiencies due to control flow (loop handling) and memory access, without drastically increasing processor complexity. Moving from a state-of-the-art (SoA) ML accelerator cluster based on RISC-V processors, we propose a low-overhead optimized microarchitecture that eliminates these inefficiencies almost entirely while retaining programmability. We introduce"zero-overhead loop nests"to remove control overheads, and a"zero-conflict memory subsystem", leveraging a novel double-buffering-aware interconnect, to eliminate bank conflicts in L1 memory. With these enhancements, we attain near-ideal utilizations between 96.1% and 99.4%, achieving 11% performance and 8% energy efficiency improvements over the baseline SoA RISC-V cluster. We demonstrate comparable utilizations and performance to a specialized SoA accelerator, with only 12% difference in energy efficiency, while providing a fully-programmable general-purpose solution supporting a significantly wider range of workloads.
Problem

Research questions and friction points this paper is trying to address.

Eliminate control and memory inefficiencies in RISC-V ML accelerators
Achieve near-ideal utilization for matrix multiplication workloads
Balance energy efficiency with programmable flexibility for ML acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-overhead loop nests for control
Zero-conflict memory subsystem design
Double-buffering-aware interconnect optimization
Luca Colagrande
Luca Colagrande
PhD student, ETH Zurich
Computer ArchitectureHigh-Performance ComputingMachine LearningIntegrated Circuits
L
Lorenzo Leone
Integrated Systems Laboratory (IIS), ETH Zurich, Zurich, Switzerland
M
Maximilian Coco
D-ITET, ETH Zurich, Zurich, Switzerland
A
Andrei Deaconeasa
D-ITET, ETH Zurich, Zurich, Switzerland
Luca Benini
Luca Benini
ETH Zürich, Università di Bologna
Integrated CircuitsComputer ArchitectureEmbedded SystemsVLSIMachine Learning