Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the challenge of efficiently evaluating 3D-stacked AI chips for large language model (LLM) inference, where architectural complexity in distributed designs hinders accurate performance assessment. To this end, we propose Voxel, a compiler-aware, end-to-end simulation framework that enables co-design exploration across hardware and software. Voxel is the first to support systematic, joint analysis of multiple design dimensions—including computation paradigms, core/memory mapping strategies, and NoC topologies—through key innovations such as a customizable execution plan interface, fine-grained tile-to-core and tensor-to-bank mapping schemes, and detailed modeling of on-chip networks and memory bandwidth/energy efficiency. Experimental results demonstrate strong alignment between Voxel and real silicon-level simulators, highlighting the critical impact of compute–memory mapping on end-to-end system efficiency. The framework and associated artifacts are publicly released.

📝 Abstract

To overcome the well-known memory bottleneck of AI chips, 3D stacked architectures that employ advanced packaging technology with high-density through-silicon vias (TSVs) pins have proven to be a promising solution. The 3D-stacked AI chip enables ultra-high memory bandwidth between compute and memory by stacking numerous DRAM banks atop many AI cores in a distributed manner. However, it is not easy to explore the efficiency of the 3D-stacked AI chip, due to its unique distributed nature. And we need to carefully consider multiple intertwined factors that range from upper-level computing paradigm to machine learning (ML) compiler optimizations, and to the underlying hardware architecture. In this paper, we develop Voxel, a fast and compiler-aware end-to-end simulation framework to facilitate exploring the efficiency of 3D-stacked AI chips for large language model (LLM) inference. Voxel enables the software/hardware co-exploration by employing a programming interface that allows ML compilers to customize the model execution plans. After validating the results of Voxel with an emulator on real silicon, we thoroughly examine the impact and correlation of different aspects of 3D-stacked AI chips, including state-of-the-art compute paradigms, tile-to-core mapping, tensor-to-bank mapping, NoC topologies and link bandwidth, DRAM bank bandwidth, per-core SRAM capacity, and energy/thermal constraints. Our findings disclose that the end-to-end efficiency of a 3D stacked AI chip not only is determined by the cooperative function of these factors, but also significantly depends on the mappings from tiles to AI core and DRAM banks. We report our findings throughout the paper, with the expectation that they will shed light on the development of the 3D-stacked AI chip ecosystem. We will open source Voxel and our study results for public research.

Problem

Research questions and friction points this paper is trying to address.

3D-stacked AI chip

LLM inference

memory bottleneck

hardware-software co-exploration

efficiency evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-stacked AI chip

Voxel

LLM inference