๐ค AI Summary
Transformer inference on hardware accelerators is often bottlenecked not by computational capacity but by paged data movement and interconnect bandwidth. This work proposes a system-accelerator co-design that replaces large on-chip SRAM with small caches and a paged streaming scheduler, enabling explicit overlap of computation and data transfer through a DMA-compute-DMA-out pipeline and 4KB-tiled matrix multiplication on the loosely coupled systolic array MatrixFlow. Evaluated using an extended Gem5-AcceSys full-system simulation framework, the proposed approach achieves up to 22ร speedup over a CPU-only baseline and outperforms existing loosely and tightly coupled accelerators by 5โ8ร. Notably, it attains 80% of the performance achievable with on-chip HBM while operating under standard PCIe host memory constraints.
๐ Abstract
Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by paged data movement and interconnect bandwidth rather than raw MAC count. This work proposes a unified system-accelerator co-design approach for transformer inference that jointly optimizes a matrix accelerator and its system integration through paged streaming dataflows and explicit overlap of compute and transfer. On the hardware side, we introduce MatrixFlow, a loosely coupled 16x16 systolic-array accelerator with a page-aligned block matrix multiplication method using 4 KB tiles, a small on-chip buffer of about 20 KB, and a pipelined schedule of DMA, compute, and DMA-out to utilize interconnect bandwidth efficiently. On the system side, we develop Gem5-AcceSys, an extension of the gem5 full-system simulator that explores standard interconnects such as PCIe and configurable memory hierarchies including Direct Memory, Direct Cache, and Device Memory modes with SMMU/TLB effects. We evaluate the co-design using gem5 simulations on representative transformer models including BERT and ViT across multiple data types and system setups. Results show up to 22x end-to-end speedup over a CPU-only baseline and 5x to 8x gains over state-of-the-art loosely and tightly coupled accelerators. We further show that a standard PCIe-based host-memory design can achieve about 80 percent of the performance of on-device HBM. Overall, paged streaming and pipeline overlap, rather than large local SRAMs, are the most effective levers for efficient transformer inference under realistic system constraints.