Mitigating the Bandwidth Wall via Data-Streaming System-Accelerator Co-Design

๐Ÿ“… 2026-03-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Transformer inference on hardware accelerators is often bottlenecked not by computational capacity but by paged data movement and interconnect bandwidth. This work proposes a system-accelerator co-design that replaces large on-chip SRAM with small caches and a paged streaming scheduler, enabling explicit overlap of computation and data transfer through a DMA-compute-DMA-out pipeline and 4KB-tiled matrix multiplication on the loosely coupled systolic array MatrixFlow. Evaluated using an extended Gem5-AcceSys full-system simulation framework, the proposed approach achieves up to 22ร— speedup over a CPU-only baseline and outperforms existing loosely and tightly coupled accelerators by 5โ€“8ร—. Notably, it attains 80% of the performance achievable with on-chip HBM while operating under standard PCIe host memory constraints.

Technology Category

Application Category

๐Ÿ“ Abstract
Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by paged data movement and interconnect bandwidth rather than raw MAC count. This work proposes a unified system-accelerator co-design approach for transformer inference that jointly optimizes a matrix accelerator and its system integration through paged streaming dataflows and explicit overlap of compute and transfer. On the hardware side, we introduce MatrixFlow, a loosely coupled 16x16 systolic-array accelerator with a page-aligned block matrix multiplication method using 4 KB tiles, a small on-chip buffer of about 20 KB, and a pipelined schedule of DMA, compute, and DMA-out to utilize interconnect bandwidth efficiently. On the system side, we develop Gem5-AcceSys, an extension of the gem5 full-system simulator that explores standard interconnects such as PCIe and configurable memory hierarchies including Direct Memory, Direct Cache, and Device Memory modes with SMMU/TLB effects. We evaluate the co-design using gem5 simulations on representative transformer models including BERT and ViT across multiple data types and system setups. Results show up to 22x end-to-end speedup over a CPU-only baseline and 5x to 8x gains over state-of-the-art loosely and tightly coupled accelerators. We further show that a standard PCIe-based host-memory design can achieve about 80 percent of the performance of on-device HBM. Overall, paged streaming and pipeline overlap, rather than large local SRAMs, are the most effective levers for efficient transformer inference under realistic system constraints.
Problem

Research questions and friction points this paper is trying to address.

Transformers
hardware acceleration
bandwidth bottleneck
data movement
system constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

paged streaming
system-accelerator co-design
transformer inference
bandwidth wall
pipelined compute-transfer overlap
๐Ÿ”Ž Similar Papers
No similar papers found.