Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving well-calibrated uncertainty quantification in large-scale Mixture-of-Experts (MoE) models. We propose VMoER, the first structured variational approach that integrates Bayesian inference into the MoE routing mechanism. By imposing variational constraints during routing, VMoER performs amortized inference over routing logits and learns a temperature parameter to enable stochastic expert selection. Remarkably, this method achieves substantial performance improvements with negligible computational overhead—less than 1% additional FLOPs—including a 38% gain in routing noise stability, a 94% reduction in calibration error, and a 12% increase in out-of-distribution detection AUROC. VMoER thus offers a scalable solution that simultaneously enhances model robustness, calibration, and efficiency.

Technology Category

Application Category

📝 Abstract
Foundation models are increasingly being deployed in contexts where understanding the uncertainty of their outputs is critical to ensuring responsible deployment. While Bayesian methods offer a principled approach to uncertainty quantification, their computational overhead renders their use impractical for training or inference at foundation model scale. State-of-the-art models achieve parameter counts in the trillions through carefully engineered sparsity including Mixture-of-Experts (MoE) layers. In this work, we demonstrate calibrated uncertainty at scale by introducing Variational Mixture-of-Experts Routing (VMoER), a structured Bayesian approach for modelling uncertainty in MoE layers. VMoER confines Bayesian inference to the expert-selection stage which is typically done by a deterministic routing network. We instantiate VMoER using two inference strategies: amortised variational inference over routing logits and inferring a temperature parameter for stochastic expert selection. Across tested foundation models, VMoER improves routing stability under noise by 38\%, reduces calibration error by 94\%, and increases out-of-distribution AUROC by 12\%, while incurring less than 1\% additional FLOPs. These results suggest VMoER offers a scalable path toward robust and uncertainty-aware foundation models.
Problem

Research questions and friction points this paper is trying to address.

uncertainty quantification
Mixture-of-Experts
Bayesian inference
foundation models
model calibration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Routing
Mixture-of-Experts
Bayesian Inference
Uncertainty Quantification
Foundation Models
🔎 Similar Papers
No similar papers found.