🤖 AI Summary
This work investigates why Mixture-of-Experts (MoE) models outperform dense networks under identical parameter budgets in the presence of input feature noise. Focusing on scenarios where inputs exhibit latent modular structure corrupted by noise, the study proposes that MoE implicitly filters noise through its sparse activation mechanism. Through theoretical analysis, synthetic data experiments, and real-world language tasks, the authors demonstrate that MoE achieves superior noise robustness, lower generalization error, and faster convergence by leveraging sparse, modular computation. The findings reveal that MoE not only maintains computational efficiency but also significantly enhances generalization performance, offering a novel perspective on the advantages of sparse architectures in noisy environments.
📝 Abstract
Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.