🤖 AI Summary
Existing FlashAttention relies on hand-tuned GPU optimizations, exhibiting poor cross-architecture generalizability; meanwhile, large language models (LLMs) struggle to generate high-performance attention kernels due to their inability to model complex data flows and GPU-specific primitives. Method: We propose LLM-TL—a novel, LLM-friendly thinking language—that for the first time decouples attention computation into high-level optimization logic and low-level implementation, enabling a two-stage TL-Code generation and translation framework. Contribution/Results: Our approach supports zero-shot hardware and data-type adaptation, automatically synthesizing optimized attention kernels across diverse GPUs (e.g., A100, RTX 8000, T4). Experiments show generated kernels match FlashAttention’s performance, achieving up to 35.16× speedup over baseline implementations and outperforming cuDNN. Hardware adaptation time is reduced from months to minutes, establishing the first self-optimizing paradigm for attention operators.
📝 Abstract
The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance. To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs' understanding of attention operator. Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms. Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16x. Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.