FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

📅 2026-01-01
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of effectively integrating AI-generated GPU kernels into large language model (LLM) inference systems, where a lack of standardized evaluation and deployment pipelines has hindered practical adoption. We propose the first end-to-end reproducible framework tailored for AI-generated GPU kernels, featuring a unified FlashInfer Trace schema, real-world serving traces, a comprehensive evaluation methodology that jointly assesses correctness and performance, and a dynamic injection interface via an apply() function. This enables seamless collaboration across kernel generation, benchmarking, and production deployment in systems such as SGLang and vLLM. The project also introduces a public benchmark dataset and leaderboard, offering systematic insights into the capabilities and linguistic trade-offs of LLM agents in GPU kernel programming, thereby establishing a practical paradigm for future agent-driven high-performance kernel optimization.

Technology Category

Application Category

📝 Abstract
Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. FlashInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents'GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. FlashInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them into large-scale LLM inference.
Problem

Research questions and friction points this paper is trying to address.

AI-generated kernels
LLM agents
GPU programming
inference systems
kernel deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

FlashInfer-Bench
AI-generated kernels
closed-loop framework
dynamic kernel substitution
LLM agents
🔎 Similar Papers
No similar papers found.