Temporally Extended Mixture-of-Experts Models

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
This work addresses the challenge of inefficient GPU memory utilization in traditional Mixture-of-Experts (MoE) models during inference, caused by frequent expert switching that undermines memory optimization and hinders deployment under limited memory budgets. To mitigate this, the authors propose a temporally extended MoE layer that integrates the reinforcement learning option framework, enabling a lightweight controller to learn when and how to switch expert sets. The approach combines the Option-Critic architecture, low-rank adapters (LoRA), a self-distillation-based reward mechanism, and an expert preloading strategy, while incorporating deliberation cost to balance expert switching frequency against model capacity. Evaluated on gpt-oss-20b, the method reduces expert switching rates from over 50% to below 5%, while preserving more than 90% of the original accuracy on MATH, MMLU, and MMMLU benchmarks.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
expert switching
memory efficiency
temporal extension
GPU memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
options framework
temporal extension
deliberation cost
memory-efficient serving