Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the performance bottleneck of CuTile-based FlashAttention on NVIDIA’s GB10 architecture, which stems primarily from L2 cache misses due to non-local memory access patterns. The study identifies this issue as the dominant cause of inefficiency and introduces a novel Sawtooth Wavefront Reordering technique that enhances cache locality by jointly optimizing wavefront scheduling and memory access ordering. Implemented within the CUDA and CuTile programming models, the proposed method reduces L2 cache misses by over 50% and achieves up to a 60% improvement in attention computation throughput on GB10 hardware. These results demonstrate a significant advance in efficient Transformer inference, offering a new direction for optimizing attention mechanisms on modern GPU architectures.

Technology Category

Application Category

📝 Abstract

High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.

Problem

Research questions and friction points this paper is trying to address.

FlashAttention

L2 cache miss

CuTile

memory behavior

attention kernels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sawtooth Wavefront Reordering

CuTile

FlashAttention