Sim-FA: A Simulator Frontend for Asynchronous Pipelines

๐Ÿ“… 2026-05-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

217K/year
๐Ÿค– AI Summary
This work addresses the limitations of existing academic simulation tools, which struggle to accurately model emerging GPU featuresโ€”such as the Tensor Memory Accelerator (TMA)โ€”and suffer from inaccuracies in DRAM traffic estimation, thereby hindering AI architecture research. We present the first cycle-accurate simulation of FlashAttention-3, establishing an end-to-end pipeline that integrates kernel instrumentation with high-fidelity emulation. By combining analytical modeling with cycle-level simulation, our approach faithfully captures modern GPGPU asynchronous pipeline characteristics, including warp specialization and TMA. Validated on the H800 platform, our simulator achieves a mean absolute percentage error of 5.7% and a maximum error of 12.7%, significantly improving simulation fidelity and uncovering the root causes of inaccuracies in prior analytical models.
๐Ÿ“ Abstract
To efficiently support Large Language Models (LLMs), modern GPGPU architectures have introduced new features and programming paradigms, such as warp specialization. These features enable temporal overlap between the producer and consumer, as well as between matrix multiplication and activation function operations, substantially improving performance. To conduct effective AI infrastructure and computer architecture research, cycle-accurate simulators that support these new features, together with analytical models that faithfully capture workload characteristics, are essential. However, existing academic tools provide limited support for these emerging requirements. Existing cycle-accurate simulators do not incorporate new NVIDIA GPU features, such as the Tensor Memory Accelerator (TMA), in a timely manner. Moreover, existing analytical models can misestimate DRAM traffic under certain configurations. In this paper, we build a simulation pipeline from FlashAttention-3 kernel instrumentation to cycle-accurate simulation. The simulator achieves a mean absolute percentage error (MAPE) of 5.7\% and a maximum absolute percentage error of 12.7\% against H800. We also provide a theoretical analysis of FlashAttention-3 and explain why existing analytical models can produce inaccurate traffic estimates.
Problem

Research questions and friction points this paper is trying to address.

cycle-accurate simulation
GPGPU architecture
DRAM traffic estimation
warp specialization
Tensor Memory Accelerator
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sim-FA
cycle-accurate simulation
FlashAttention-3
Tensor Memory Accelerator
warp specialization
๐Ÿ”Ž Similar Papers
No similar papers found.