FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limitations of existing deep learning compilers, which rely on hand-optimized libraries and require expert-written CUDA or CUTLASS code when suitable kernels are unavailable, while current large language model (LLM)-based approaches fail to effectively leverage mature optimization patterns from these libraries. To overcome this, the authors propose FACT, a novel framework that integrates agent-driven graph-level pattern discovery with automated generation, tuning, and dynamic cataloging of CUTLASS kernels, enabling end-to-end synthesis of high-performance GPU kernels directly from PyTorch computation graphs. Evaluated on an NVIDIA A100 GPU, FACT achieves 1.06–1.18× speedup over cuBLAS on GEMM-like operations and delivers a 2.79× end-to-end acceleration in MiniGPT by fusing multi-head attention and MLP layers.

📝 Abstract

Deep learning compilers and vendor libraries deliver strong baseline performance but are bounded by finite, engineer-curated catalogs. When these omit needed optimizations, practitioners substitute hand-written CUDA or CUTLASS, demanding expertise in GPU microarchitecture and C++ template metaprogramming. Recent LLM-based agents target kernel generation in raw CUDA, forcing rediscovery of optimizations already encoded in mature libraries. We present FACT (Framework for Agentic CUTLASS Transpilation), a framework that employs a three-stage, agent-driven workflow optimizing PyTorch modules through multi-pattern composition while grounding synthesis in CUTLASS C++. (1) Pattern discovery: an LLM agent inspects the traced graph, matches subgraphs to optimization rules, retrieves vetted examples from an architecture-specific index, and outputs prioritized patterns. (2) Pattern realization: each pattern is implemented as a CUTLASS kernel wrapped in a PyTorch extension, verified, and auto-tuned by sweeping parameters inferred from the CUTLASS hierarchy. (3) Pattern composition: extensions are loaded together into a single composed module for end-to-end benchmarking. We evaluate the workflow using KernelBench's evaluation framework and provided modules on an NVIDIA A100. On Level 1, we apply the workflow to three GEMM workloads (square matrix multiply, batched matrix multiply, and large-$K$ matrix multiply). Auto-tuned CUTLASS kernels improve over PyTorch cuBLAS baseline by $1.06\times$--$1.18\times$. On Level 3 MiniGPT block, composing fused multi-head attention with fused MLP GEMM+GELU yields $2.79\times$ end-to-end speedup. Our work couples agentic graph-level pattern discovery with auto-tuning and a dynamic pattern table, offering a practical path from traced PyTorch to deployable kernels by automating CUTLASS kernel synthesis and auto-tuning.

Problem

Research questions and friction points this paper is trying to address.

kernel synthesis

CUTLASS

deep learning compilers

optimization patterns

GPU microarchitecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic workflow

CUTLASS synthesis

pattern composition