ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high I/O overhead and performance degradation in memory-constrained inference of Mixture-of-Experts (MoE) large language models, caused by frequent loading of uncached experts from external storage. To mitigate this, the authors propose a routing fine-tuning mechanism that incurs no additional inference overhead and steers the router toward selecting experts recently activated within a short temporal window. This enhances expert reuse, improves cache locality, and reduces external memory accesses. The approach is designed for fine-grained MoE architectures and integrates seamlessly with mainstream inference frameworks such as vLLM and llama.cpp. Experimental results demonstrate a 26% increase in expert reuse on DeepSeek and Qwen models, yielding 1.77–1.99× faster decoding on Jetson Orin NX devices and an 8.4% throughput improvement in vLLM.
📝 Abstract
Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77-1.99$\times$ decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA-OSCAR/ReMoE.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
memory-constrained inference
expert caching
I/O overhead
LLM inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
router fine-tuning
expert reuse
memory-constrained inference
cache locality
🔎 Similar Papers