๐ค AI Summary
This work addresses the limitations of existing academic simulation tools, which struggle to accurately model emerging GPU featuresโsuch as the Tensor Memory Accelerator (TMA)โand suffer from inaccuracies in DRAM traffic estimation, thereby hindering AI architecture research. We present the first cycle-accurate simulation of FlashAttention-3, establishing an end-to-end pipeline that integrates kernel instrumentation with high-fidelity emulation. By combining analytical modeling with cycle-level simulation, our approach faithfully captures modern GPGPU asynchronous pipeline characteristics, including warp specialization and TMA. Validated on the H800 platform, our simulator achieves a mean absolute percentage error of 5.7% and a maximum error of 12.7%, significantly improving simulation fidelity and uncovering the root causes of inaccuracies in prior analytical models.
๐ Abstract
To efficiently support Large Language Models (LLMs), modern GPGPU architectures have introduced new features and programming paradigms, such as warp specialization. These features enable temporal overlap between the producer and consumer, as well as between matrix multiplication and activation function operations, substantially improving performance. To conduct effective AI infrastructure and computer architecture research, cycle-accurate simulators that support these new features, together with analytical models that faithfully capture workload characteristics, are essential.
However, existing academic tools provide limited support for these emerging requirements. Existing cycle-accurate simulators do not incorporate new NVIDIA GPU features, such as the Tensor Memory Accelerator (TMA), in a timely manner. Moreover, existing analytical models can misestimate DRAM traffic under certain configurations.
In this paper, we build a simulation pipeline from FlashAttention-3 kernel instrumentation to cycle-accurate simulation. The simulator achieves a mean absolute percentage error (MAPE) of 5.7\% and a maximum absolute percentage error of 12.7\% against H800. We also provide a theoretical analysis of FlashAttention-3 and explain why existing analytical models can produce inaccurate traffic estimates.