Robustness of Mixtures of Experts to Feature Noise

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates why Mixture-of-Experts (MoE) models outperform dense networks under identical parameter budgets in the presence of input feature noise. Focusing on scenarios where inputs exhibit latent modular structure corrupted by noise, the study proposes that MoE implicitly filters noise through its sparse activation mechanism. Through theoretical analysis, synthetic data experiments, and real-world language tasks, the authors demonstrate that MoE achieves superior noise robustness, lower generalization error, and faster convergence by leveraging sparse, modular computation. The findings reveal that MoE not only maintains computational efficiency but also significantly enhances generalization performance, offering a novel perspective on the advantages of sparse architectures in noisy environments.

Technology Category

Application Category

📝 Abstract
Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.
Problem

Research questions and friction points this paper is trying to address.

Mixture of Experts
feature noise
robustness
generalization error
sparse activation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts
feature noise
sparse activation
robustness
generalization error
🔎 Similar Papers
No similar papers found.
D
Dong Sun
CISPA Helmholtz Center for Information Security, Saarbrücken, Germany
R
Rahul Nittala
CISPA Helmholtz Center for Information Security, Saarbrücken, Germany
Rebekka Burkholz
Rebekka Burkholz
CISPA Helmholtz Center for Information Security
Machine LearningDeep Learning EfficiencyComplex NetworksCascadesGene Regulation