RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

259K/year
🤖 AI Summary
Existing MoE inference schedulers rely solely on batch size while ignoring expert routing distributions, resulting in 10%–70% kernel throughput waste. This work proposes RaMP, a novel framework that introduces the first routing-aware kernel scheduling mechanism. RaMP constructs a kernel-agnostic four-parameter wave cost model using runtime expert histograms and lightweight one-time profiling (10–24 minutes), then dynamically selects optimal kernel configurations by integrating CTA grid geometry modeling with performance region analysis. The approach accurately predicts performance regions even on unseen architectures. Experimental results demonstrate that RaMP achieves a 1.22× speedup over static scheduling at the kernel level. When integrated into vLLM, it outperforms state-of-the-art backends—delivering end-to-end inference speedups of 1.30× over Triton, 1.41× over DeepGEMM, and 1.13× over FlashInfer CUTLASS.
📝 Abstract
The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
kernel configuration
expert routing
inference optimization
runtime awareness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
runtime-aware scheduling
kernel polymorphism
performance modeling
GPU kernel optimization