ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

234K/year
🤖 AI Summary
This work addresses the performance gap between GPU kernels generated by large language models and highly optimized hand-tuned libraries for critical operators such as matrix multiplication, attention, and Mixture-of-Experts (MoE). To bridge this gap, the authors propose ARGUS, a novel framework that integrates dataflow invariants and a labeled Pythonic domain-specific language with abstract interpretation and SMT-based verification to deliver structured diagnostic feedback at zero runtime cost. ARGUS further employs a context-aware reinforcement learning planner to automatically co-optimize key optimization strategies—including tiling, shared memory scheduling, and software pipelining. Evaluated on AMD MI300X hardware, kernels synthesized by ARGUS achieve 99–104% of the performance of hand-optimized assembly, outperform existing agent-based systems by 2× to 1543×, and successfully solve all Level 1 and 90% of Level 2 tasks in KernelBench.

Technology Category

Application Category

📝 Abstract
LLM-based coding agents can generate functionally correct GPU kernels, yet their performance remains far below hand-optimized libraries on critical computations such as matrix multiplication, attention, and Mixture-of-Experts (MoE). Peak GPU performance requires coordinated reasoning over tightly coupled optimizations, including tiling, shared-memory staging, software pipelining, and instruction scheduling, while existing agents rely on sparse pass/fail feedback, leaving them unable to diagnose global constraint violations. We present Argus, an agentic framework that addresses this through data-flow invariants: compile-time specifications encoding how data must be choreographed throughout kernel execution. Argus introduces a tile-based, Pythonic DSL exposing hardware instructions and compiler policies while hiding low-level representations. The DSL provides tag functions to propagate symbolic annotations through data and control flow, and tag assertions to enforce relational constraints at use sites. When violations occur, the compiler returns concrete counterexamples identifying the thread, data element, and program point, enabling dense, structured feedback for targeted fixes. Invariants are verified at compile time via abstract interpretation over a layout algebra and SMT solving, with zero runtime overhead. An in-context reinforcement learning planner learns to select optimizations and synthesize effective invariants, supported by a curated knowledge base of GPU optimization techniques. We evaluate Argus on the AMD MI300X GPU across GEMM, flash attention, and MoE kernels accounting for over 90% of GPU time in LLM inference. Generated kernels achieve 99-104% of state-of-the-art hand-optimized assembly throughput and are 2-1543x faster than existing agentic systems. Argus further generalizes to 200 KernelBench tasks, solving 100% of Level 1 and 90% of Level 2 problems.
Problem

Research questions and friction points this paper is trying to address.

GPU kernel optimization
agentic code generation
data-flow invariants
performance gap
global constraint violation
Innovation

Methods, ideas, or system contributions that make the work stand out.

data-flow invariants
agentic GPU optimization
tile-based DSL
compile-time verification
in-context reinforcement learning
🔎 Similar Papers
No similar papers found.