QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing FlashAttention relies on hand-tuned GPU optimizations, exhibiting poor cross-architecture generalizability; meanwhile, large language models (LLMs) struggle to generate high-performance attention kernels due to their inability to model complex data flows and GPU-specific primitives. Method: We propose LLM-TL—a novel, LLM-friendly thinking language—that for the first time decouples attention computation into high-level optimization logic and low-level implementation, enabling a two-stage TL-Code generation and translation framework. Contribution/Results: Our approach supports zero-shot hardware and data-type adaptation, automatically synthesizing optimized attention kernels across diverse GPUs (e.g., A100, RTX 8000, T4). Experiments show generated kernels match FlashAttention’s performance, achieving up to 35.16× speedup over baseline implementations and outperforming cuDNN. Hardware adaptation time is reduced from months to minutes, establishing the first self-optimizing paradigm for attention operators.

Technology Category

Application Category

📝 Abstract
The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance. To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs' understanding of attention operator. Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms. Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16x. Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.
Problem

Research questions and friction points this paper is trying to address.

Attention operator bottleneck in LLMs for long-context scenarios
Manual GPU-specific attention implementation limits adaptability
LLMs struggle to generate high-performance attention code
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-friendly Thinking Language for decoupling optimization
2-stage reasoning workflow for automatic FlashAttention generation
Supports diverse GPUs and reduces development time
🔎 Similar Papers
No similar papers found.
Q
Qirui Zhou
SKL of Processors, Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Shaohui Peng
Shaohui Peng
Institute of Software Chinese Academy of Sciences
Embodied AIReinforcement Learning
W
Weiqiang Xiong
Intelligent Software Research Center, Institute of Software, CAS, Beijing China; University of Chinese Academy of Sciences, Beijing, China
H
Haixin Chen
SKL of Processors, Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Yuanbo Wen
Yuanbo Wen
Institute of Computing Technology, Chinese Academy of Sciences
Machine Learning System
Haochen Li
Haochen Li
Tsinghua university
cell-cell communicationsingle-cell genomicsspatial transcriptomics
L
Ling Li
Intelligent Software Research Center, Institute of Software, CAS, Beijing China; University of Chinese Academy of Sciences, Beijing, China
Q
Qi Guo
SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
Yongwei Zhao
Yongwei Zhao
Institute of Computing Technology, Chinese Academy of Sciences
Computer Architecture
K
Ke Gao
Intelligent Software Research Center, Institute of Software, CAS, Beijing China
Ruizhi Chen
Ruizhi Chen
Intelligent Software Research Center, Institute of Software, CAS, Beijing China
Yanjun Wu
Yanjun Wu
Institute of Software, Chinese Academy of Sciences
Computer Science
C
Chen Zhao
Intelligent Software Research Center, Institute of Software, CAS, Beijing China
Yunji Chen
Yunji Chen
Institute of Computing Technology, Chinese Academy of Sciences
processor architecturemicroarchitecturemachine learning