From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high on-chip SRAM access energy that becomes a performance bottleneck for Transformer models when processing long sequences. To overcome this limitation, the authors propose the 3D-Flow architecture together with the 3D-FlashAttention scheduling method, which uniquely integrates register-level vertical interconnects with fine-grained attention scheduling. Leveraging hybrid-bonded 3D stacking with sub-10-micron through-silicon vias (TSVs), the design enables direct cross-layer communication between processing element (PE) registers, establishing a bubble-free vertical dataflow that completely eliminates on-chip cache round-trips. By breaking the reliance on on-chip memory inherent in conventional 2D and 3D accelerators, the approach achieves 46%–93% energy reduction and delivers speedups of 1.4× to 7.6× on OPT and QWEN models.

Technology Category

Application Category

📝 Abstract
Transformer-based models dominate modern AI workloads but exacerbate memory bottlenecks due to their quadratic attention complexity and ever-growing model sizes. Existing accelerators, such as Groq and Cerebras, mitigate off-chip traffic with large on-chip caches, while algorithmic innovations such as FlashAttention fuse operators to avoid materializing large attention matrices. However, as off-chip traffic decreases, our measurements show that on-chip SRAM accesses account for over 60% of energy in long-sequence workloads, making cache access the new bottleneck. We propose 3D-Flow, a hybrid-bonded, 3D-stacked spatial accelerator that enables register-to-register communication across vertically partitioned PE tiers. Unlike 2D multi-array architectures limited by NoC-based router-to-router transfers, 3D-Flow leverages sub-10 um vertical TSVs to sustain cycle-level operator pipelining with minimal overhead. On top of this architecture, we design 3D-FlashAttention, a fine-grained scheduling method that balances latency across tiers, forming a bubble-free vertical dataflow without on-chip SRAM roundtrips. Evaluations on Transformer workloads (OPT and QWEN models) show that our 3D spatial accelerator reduces 46-93% energy consumption and achieves 1.4x-7.6x speedups compared to state-of-the-art 2D and 3D designs.
Problem

Research questions and friction points this paper is trying to address.

memory bottleneck
on-chip SRAM access
Transformer models
attention mechanism
energy efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-stacked NPU
Hybrid bonding
Register-to-register communication
FlashAttention
Vertical dataflow
🔎 Similar Papers
No similar papers found.
J
Jinxin Yu
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Y
Yudong Pan
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Mengdi Wang
Mengdi Wang
Institute of Computing Technology, Chinese Academy of Sciences
accelerator architecture designmulti-core system
Huawei Li
Huawei Li
Institute of Computing Technology, Chinese Academy of Sciences
computer engineering
Y
Yinhe Han
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
X
Xiaowei Li
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Ying Wang
Ying Wang
Institute of Computing Technology, Chinese Academy of Sciences
Reliable Computer ArchitectureVLSI designMachine learningMemory system