TensorPool: A 3D-Stacked 8.4TFLOPS/4.3W Many-Core Domain-Specific Processor for AI-Native Radio Access Networks

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the conflicting demands of high computational complexity, sub-millisecond real-time processing, and ultra-low power consumption (≤100 W) in 6G AI-Native radio access networks by proposing a domain-specific multicore processor architecture fabricated in TSMC N7 technology. The design integrates 256 RISC-V32IMAF cores and 16 FP16 tensor engines, leveraging 3D-stacked interconnects and L1 scratchpad memory optimization to enhance memory access efficiency. Operating at constant frequency, the architecture achieves a 2.32× reduction in chip area while delivering 3643 MACs per cycle (89% utilization) for AI-RAN tensor operations. Compared to a baseline without tensor acceleration, it provides a 6× performance improvement and a 9.1× gain in energy-area efficiency (GOPS/W/mm²).

Technology Category

Application Category

📝 Abstract

The upcoming integration of AI in the physical layer (PHY) of 6G radio access networks (RAN) will enable a higher quality of service in challenging transmission scenarios. However, deeply optimized AI-Native PHY models impose higher computational complexity compared to conventional baseband, challenging deployment under the sub-msec real-time constraints typical of modern PHYs. Additionally, following the extension to terahertz carriers, the upcoming densification of 6G cell-sites further limits the power consumption of base stations, constraining the budget available for compute ($\leq$ 100W). The desired flexibility to ensure long term sustainability and the imperative energy-efficiency gains on the high-throughput tensor computations dominating AI-Native PHYs can be achieved by domain-specialization of many-core programmable baseband processors. Following the domain-specialization strategy, we present TensorPool, a cluster of 256 RISCV32IMAF programmable cores, accelerated by 16 256 MACs/cycle (FP16) tensor engines with low-latency access to 4MiB of L1 scratchpad for maximal data-reuse. Implemented in TSMC's N7, TensorPool achieves 3643~MACs/cycle (89% tensor-unit utilization) on tensor operations for AI-RAN, 6$\times$ more than a core-only cluster without tensor acceleration, while simultaneously improving GOPS/W/mm$^2$ efficiency by 9.1$\times$. Further, we show that 3D-stacking the computing blocks of TensorPool to better unfold the tensor engines to L1-memory routing provides 2.32$\times$ footprint improvement with no frequency degradation, compared to a 2D implementation.

Problem

Research questions and friction points this paper is trying to address.

AI-Native RAN

6G PHY

tensor computation

real-time constraints

power efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

domain-specific architecture

3D-stacked processor

tensor engine

AI-Native RAN

energy-efficient computing

🔎 Similar Papers

A Hierarchical Dataflow-Driven Heterogeneous Architecture for Wireless Baseband Processing

2024-02-28Asia and South Pacific Design Automation ConferenceCitations: 1

💼 Related Jobs

No related jobs found.

Authors to Follow