TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a physically realizable, ultra-scale shared L1 memory cluster architecture that integrates 1024 floating-point RISC-V processing elements (PEs) to overcome the power and performance bottlenecks of conventional massively parallel architectures, which are constrained by data partitioning, data movement across memory hierarchies, and high-latency interconnects. The design features a low-latency hierarchical interconnect (1–11 cycles) enabling access to a multi-megabyte, 4000+ bank L1 memory system and incorporates an HBM2E-compatible high-bandwidth main memory interface. For the first time, it achieves a thousand-core shared L1 memory while circumventing the area and power limitations of full-crossbar interconnects, quadrupling the PE count reported in prior literature. Implemented in 12nm FinFET technology, the system delivers a peak single-precision performance of 1.89 TFLOP/s at 910 MHz with an energy efficiency of 200 GFLOP/s/W, and memory access energy as low as 9–13.5 pJ—approaching the cost of an FP32 FMA operation.

Technology Category

Application Category

📝 Abstract
Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). Scaling out these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memory hierarchies via the high-latency global interconnect. Scaling up the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with PE-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, >1000 floating-point-capable RISC-V PEs scaled-up cluster design, sharing a Multi-MegaByte >4000-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910MHz) typical, 0.80 V/25C. The energy-efficient hierarchical PE-to-L1-memory interconnect consumes only 9-13.5pJ for memory bank accesses, just 0.74-1.1x the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910MHz, the cluster delivers up to 1.89 single precision TFLOP/s peak performance and up to 200GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.
Problem

Research questions and friction points this paper is trying to address.

shared-L1-memory
scalability
crossbar complexity
massively parallel architectures
physical design
Innovation

Methods, ideas, or system contributions that make the work stand out.

shared-L1-memory
RISC-V
hierarchical interconnect
massively parallel architecture
energy-efficient computing
🔎 Similar Papers
No similar papers found.
Yichao Zhang
Yichao Zhang
Ph.D Student, ETH Zurich
IC DesignRISC-VMany-coreVector ProcessingB5G/6G
Marco Bertuletti
Marco Bertuletti
PhD student, ETH Zurich
computer architecturesparallel programmingwireless communications
C
Chi Zhang
Integrated Systems Laboratory (IIS), ETH Zurich, 8092 Zurich, Switzerland
S
Samuel Riedel
Integrated Systems Laboratory (IIS), ETH Zurich, 8092 Zurich, Switzerland
D
Diyou Shen
Integrated Systems Laboratory (IIS), ETH Zurich, 8092 Zurich, Switzerland
B
Bowen Wang
Integrated Systems Laboratory (IIS), ETH Zurich, 8092 Zurich, Switzerland
Alessandro Vanelli-Coralli
Alessandro Vanelli-Coralli
Full Professor of Telecommunications, University of Bologna, Italy
Telecommunication SystemsWireless CommunicationsSatellite Communications
Luca Benini
Luca Benini
ETH Zürich, Università di Bologna
Integrated CircuitsComputer ArchitectureEmbedded SystemsVLSIMachine Learning