Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scalability bottleneck of Mixture-of-Experts (MoE) models—where high training/inference costs constrain expert count and hinder simultaneous capacity scaling and efficiency—this paper proposes Test-Time Model Merging (TTMM). During training, multiple expert parameters are fused into lightweight, composable modules; at test time, dynamic weighted combination enables >100× expert expansion without additional forward passes or gradient updates. TTMM is the first method to leverage model merging for approximating test-time training (TTT), achieving >100× speedup over full TTT on a 1B-parameter base model with near-zero latency overhead. Performance improves monotonically with expert count and closely approaches that of exhaustive TTT. The core contribution is a novel trade-off: trading training-time parameter fusion for zero-overhead, adaptive test-time behavior—enabling highly efficient, “essentially free” test-time training while preserving ultra-low inference latency.

Technology Category

Application Category

📝 Abstract
Mixture of expert (MoE) models are a promising approach to increasing model capacity without increasing inference cost, and are core components of many state-of-the-art language models. However, current MoE models typically use only few experts due to prohibitive training and inference cost. We propose Test-Time Model Merging (TTMM) which scales the MoE paradigm to an order of magnitude more experts and uses model merging to avoid almost any test-time overhead. We show that TTMM is an approximation of test-time training (TTT), which fine-tunes an expert model for each prediction task, i.e., prompt. TTT has recently been shown to significantly improve language models, but is computationally expensive. We find that performance of TTMM improves with more experts and approaches the performance of TTT. Moreover, we find that with a 1B parameter base model, TTMM is more than 100x faster than TTT at test-time by amortizing the cost of TTT at train-time. Thus, TTMM offers a promising cost-effective approach to scale test-time training.
Problem

Research questions and friction points this paper is trying to address.

Scales MoE models to more experts efficiently
Reduces test-time overhead via model merging
Approximates costly test-time training performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scales MoE models with many experts
Uses model merging to reduce test-time overhead
Approximates test-time training cost-effectively
🔎 Similar Papers
No similar papers found.
R
Ryo Bertolissi
ETH Zürich, Switzerland
J
Jonas Hubotter
ETH Zürich, Switzerland
Ido Hakimi
Ido Hakimi
Google Research
NLPOptimization
A
Andreas Krause
ETH Zürich, Switzerland