π€ AI Summary
This work addresses the performance bottleneck of CuTile-based FlashAttention on NVIDIAβs GB10 architecture, which stems primarily from L2 cache misses due to non-local memory access patterns. The study identifies this issue as the dominant cause of inefficiency and introduces a novel Sawtooth Wavefront Reordering technique that enhances cache locality by jointly optimizing wavefront scheduling and memory access ordering. Implemented within the CUDA and CuTile programming models, the proposed method reduces L2 cache misses by over 50% and achieves up to a 60% improvement in attention computation throughput on GB10 hardware. These results demonstrate a significant advance in efficient Transformer inference, offering a new direction for optimizing attention mechanisms on modern GPU architectures.
π Abstract
High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.