Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high inter-GPU communication overhead and low hardware utilization caused by operator scattering in multi-GPU large-model inference, this paper proposes the first fully automatic megakernel generation framework. It fuses distributed operators across GPUs into a single, unified kernel, enabling cross-operator software pipelining and fine-grained kernel overlap via SM-level task graph modeling. Key innovations include: (i) the first SM-level graph representation for operator orchestration; (ii) compiler-driven, end-to-end CUDA code generation; and (iii) a decentralized, intra-kernel runtime scheduling mechanism—all while preserving full compatibility with mainstream programming models (e.g., PyTorch). Experiments demonstrate up to 1.7× reduction in end-to-end inference latency over state-of-the-art LLM serving systems (e.g., vLLM, Triton), achieving performance approaching the GPU’s theoretical peak throughput.

Technology Category

Application Category

📝 Abstract
We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance megakernel. MPK introduces an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors (SMs), enabling cross-operator software pipelining, fine-grained kernel overlap, and other previously infeasible GPU optimizations. The MPK compiler lowers tensor programs into highly optimized SM-level task graphs and generates optimized CUDA implementations for all tasks, while the MPK in-kernel parallel runtime executes these tasks within a single mega-kernel using decentralized scheduling across SMs. Together, these components provide end-to-end kernel fusion with minimal developer effort, while preserving the flexibility of existing programming models. Our evaluation shows that MPK significantly outperforms existing kernel-per-operator LLM serving systems by reducing end-to-end inference latency by up to 1.7x, pushing LLM inference performance close to hardware limits. MPK is publicly available at https://github.com/mirage-project/mirage.
Problem

Research questions and friction points this paper is trying to address.

Automates multi-GPU model inference into a single megakernel
Enables cross-operator pipelining and fine-grained GPU optimizations
Reduces end-to-end inference latency for LLM serving systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiler and runtime for single mega-kernel generation
SM-level graph representation enabling cross-operator pipelining
Decentralized scheduling within a single persistent kernel
🔎 Similar Papers
No similar papers found.