Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address high inter-GPU communication overhead and low hardware utilization caused by operator scattering in multi-GPU large-model inference, this paper proposes the first fully automatic megakernel generation framework. It fuses distributed operators across GPUs into a single, unified kernel, enabling cross-operator software pipelining and fine-grained kernel overlap via SM-level task graph modeling. Key innovations include: (i) the first SM-level graph representation for operator orchestration; (ii) compiler-driven, end-to-end CUDA code generation; and (iii) a decentralized, intra-kernel runtime scheduling mechanism—all while preserving full compatibility with mainstream programming models (e.g., PyTorch). Experiments demonstrate up to 1.7× reduction in end-to-end inference latency over state-of-the-art LLM serving systems (e.g., vLLM, Triton), achieving performance approaching the GPU’s theoretical peak throughput.

Technology Category

Application Category

📝 Abstract

We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance megakernel. MPK introduces an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors (SMs), enabling cross-operator software pipelining, fine-grained kernel overlap, and other previously infeasible GPU optimizations. The MPK compiler lowers tensor programs into highly optimized SM-level task graphs and generates optimized CUDA implementations for all tasks, while the MPK in-kernel parallel runtime executes these tasks within a single mega-kernel using decentralized scheduling across SMs. Together, these components provide end-to-end kernel fusion with minimal developer effort, while preserving the flexibility of existing programming models. Our evaluation shows that MPK significantly outperforms existing kernel-per-operator LLM serving systems by reducing end-to-end inference latency by up to 1.7x, pushing LLM inference performance close to hardware limits. MPK is publicly available at https://github.com/mirage-project/mirage.

Problem

Research questions and friction points this paper is trying to address.

Automates multi-GPU model inference into a single megakernel

Enables cross-operator pipelining and fine-grained GPU optimizations

Reduces end-to-end inference latency for LLM serving systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiler and runtime for single mega-kernel generation

SM-level graph representation enabling cross-operator pipelining

Decentralized scheduling within a single persistent kernel

🔎 Similar Papers

Galley: Modern Query Optimization for Sparse Tensor Programs