QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing FlashAttention relies on hand-tuned GPU optimizations, exhibiting poor cross-architecture generalizability; meanwhile, large language models (LLMs) struggle to generate high-performance attention kernels due to their inability to model complex data flows and GPU-specific primitives. Method: We propose LLM-TL—a novel, LLM-friendly thinking language—that for the first time decouples attention computation into high-level optimization logic and low-level implementation, enabling a two-stage TL-Code generation and translation framework. Contribution/Results: Our approach supports zero-shot hardware and data-type adaptation, automatically synthesizing optimized attention kernels across diverse GPUs (e.g., A100, RTX 8000, T4). Experiments show generated kernels match FlashAttention’s performance, achieving up to 35.16× speedup over baseline implementations and outperforming cuDNN. Hardware adaptation time is reduced from months to minutes, establishing the first self-optimizing paradigm for attention operators.

Technology Category

Application Category

📝 Abstract

The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance. To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs' understanding of attention operator. Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms. Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16x. Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.

Problem

Research questions and friction points this paper is trying to address.

Attention operator bottleneck in LLMs for long-context scenarios

Manual GPU-specific attention implementation limits adaptability

LLMs struggle to generate high-performance attention code

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-friendly Thinking Language for decoupling optimization

2-stage reasoning workflow for automatic FlashAttention generation

Supports diverse GPUs and reduces development time

🔎 Similar Papers

No similar papers found.