When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work systematically investigates the effectiveness of sparse Top-k routing in image classification, addressing limitations of Sparse Mixture-of-Experts (MoE) models such as expert collapse and marginal end-to-end efficiency gains. The study reveals that performance improvements critically depend on the routing computation ratio (ρ) and the activation of multiple experts (k ≥ 2), while also highlighting the computational leverage provided by the backbone network. To mitigate batch scheduling inefficiencies, the authors propose a per-sample variant of Soft MoE and validate their approach through hard capacity-constrained routing, hidden dimension scaling, and ablation studies on ImageNet-1K. The method consistently outperforms baselines across CIFAR-10/100, Tiny-ImageNet, and ImageNet-1K, with the per-sample Soft MoE achieving superior accuracy over dense counterparts on CIFAR-100.

📝 Abstract

Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $\rho$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ -- holding architecture, initialization, and $\rho$ fixed -- reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho.

Problem

Research questions and friction points this paper is trying to address.

Sparse MoE

vision classification

expert collapse

compute efficiency

routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse MoE

compute leverage

top-k routing