FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency in Mixture-of-Experts (MoE) model inference, where expert parameters permanently reside in GPU memory, consuming space that could otherwise be allocated to KV caches and thereby degrading throughput and memory efficiency. To overcome this limitation, the authors propose FluxMoE, a novel system that introduces an expert paging abstraction, treating expert weights as transient, streamable resources dynamically loaded on demand and immediately released after use. Built upon vLLM, FluxMoE enables on-demand loading and eviction of experts while co-optimizing memory scheduling to prioritize critical states such as KV caches under strict memory constraints. Experimental results demonstrate that FluxMoE achieves up to a 3.0× throughput improvement over vLLM in memory-constrained scenarios without compromising model accuracy.
📝 Abstract
Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to enable efficient MoE inference under severe memory constraints. Experimental results demonstrate that FluxMoE achieves up to 3.0$\times$ throughput gains over vLLM in memory-intensive regimes, without compromising model fidelity.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
GPU memory
inference efficiency
KV cache
model serving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
expert paging
GPU memory management
KV cache optimization
inference throughput
🔎 Similar Papers
No similar papers found.
Q
Qingxiu Liu
The Chinese University of Hong Kong
C
Cyril Y. He
SCITIX
H
Hanser Jiang
SCITIX
Z
Zion Wang
SCITIX
A
Alan Zhao
SCITIX
Patrick P. C. Lee
Patrick P. C. Lee
The Chinese University of Hong Kong
storage systemsnetworksdistributed systemsdependability