A Dynamic Allocation Scheme for Adaptive Shared-Memory Mapping on Kilo-core RV Clusters for Attention-Based Model Deployment

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address multi-bank L1 memory access contention and NUMA effects induced by hierarchical interconnects in large-scale RISC-V clusters executing attention models, this paper proposes the Dynamic Address Shuffling (DAS) scheme—a synergistic approach combining runtime-configurable address remapping with NUMA-aware task scheduling and a lightweight unified memory allocator. Its core innovation is a hardware-implemented, programmable address remapping unit that adaptively optimizes data locality without modifying the software stack. Fabricated in 12 nm, DAS achieves 5.67 ms per encoder layer for ViT-L/16 inference on a 1024-core cluster—1.94× faster than the baseline—with 0.81 PE utilization and <0.1% hardware overhead. This work presents the first data-layout–access co-optimization for attention computation at the thousand-core scale.

Technology Category

Application Category

📝 Abstract
Attention-based models demand flexible hardware to manage diverse kernels with varying arithmetic intensities and memory access patterns. Large clusters with shared L1 memory, a common architectural pattern, struggle to fully utilize their processing elements (PEs) when scaled up due to reduced throughput in the hierarchical PE-to-L1 intra-cluster interconnect. This paper presents Dynamic Allocation Scheme (DAS), a runtime programmable address remapping hardware unit coupled with a unified memory allocator, designed to minimize data access contention of PEs onto the multi-banked L1. We evaluated DAS on an aggressively scaled-up 1024-PE RISC-V cluster with Non-Uniform Memory Access (NUMA) PE-to-L1 interconnect to demonstrate its potential for improving data locality in large parallel machine learning workloads. For a Vision Transformer (ViT)-L/16 model, each encoder layer executes in 5.67 ms, achieving a 1.94x speedup over the fixed word-level interleaved baseline with 0.81 PE utilization. Implemented in 12nm FinFET technology, DAS incurs <0.1 % area overhead.
Problem

Research questions and friction points this paper is trying to address.

Addresses PE-to-L1 memory access contention in kilo-core clusters
Optimizes data locality for large parallel machine learning workloads
Improves throughput in attention-based model deployment on NUMA systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Allocation Scheme for shared-memory mapping
Runtime programmable address remapping hardware
Unified memory allocator minimizes data contention
🔎 Similar Papers
No similar papers found.
B
Bowen Wang
ETH Zürich, Zürich, Switzerland
Marco Bertuletti
Marco Bertuletti
PhD student, ETH Zurich
computer architecturesparallel programmingwireless communications
Y
Yichao Zhang
ETH Zürich, Zürich, Switzerland
V
Victor J. B. Jung
ETH Zürich, Zürich, Switzerland
Luca Benini
Luca Benini
ETH Zürich, Università di Bologna
Integrated CircuitsComputer ArchitectureEmbedded SystemsVLSIMachine Learning