🤖 AI Summary
This work systematically investigates the effectiveness of sparse Top-k routing in image classification, addressing limitations of Sparse Mixture-of-Experts (MoE) models such as expert collapse and marginal end-to-end efficiency gains. The study reveals that performance improvements critically depend on the routing computation ratio (ρ) and the activation of multiple experts (k ≥ 2), while also highlighting the computational leverage provided by the backbone network. To mitigate batch scheduling inefficiencies, the authors propose a per-sample variant of Soft MoE and validate their approach through hard capacity-constrained routing, hidden dimension scaling, and ablation studies on ImageNet-1K. The method consistently outperforms baselines across CIFAR-10/100, Tiny-ImageNet, and ImageNet-1K, with the per-sample Soft MoE achieving superior accuracy over dense counterparts on CIFAR-100.
📝 Abstract
Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $\rho$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ -- holding architecture, initialization, and $\rho$ fixed -- reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho.