🤖 AI Summary
This study investigates whether geometric regularization—specifically, orthogonality-inducing losses in weight space—effectively promotes diversity and specialization among experts in Mixture-of-Experts (MoE) models. Through systematic ablation experiments across multiple datasets (WikiText-103, TinyStories, PTB), the authors apply varying strengths of orthogonality loss and evaluate its impact using metrics of weight and activation overlap, such as Mean Squared Overlap (MSO). The results reveal that orthogonality regularization neither consistently reduces weight or activation redundancy nor yields reliable performance improvements; in some cases, it even degrades model performance. Crucially, the study uncovers no significant correlation between weight-space orthogonality and expert activation behavior (r = –0.293, p = 0.523), thereby challenging the prevailing assumption that geometric regularization in weight space effectively enhances expert specialization in MoE architectures.
📝 Abstract
Mixture-of-Experts (MoE) models achieve efficiency through sparse activation, but the role of geometric regularization in expert specialization remains unclear. We apply orthogonality loss to enforce expert diversity and find it fails on multiple fronts: it does not reduce weight-space overlap (MSO actually increases by up to 114%), activation-space overlap remains high (~0.6) regardless of regularization, and effects on performance are inconsistent -- marginal improvement on WikiText-103 (-0.9%), slight degradation on TinyStories (+0.9%), and highly variable results on PTB (std>1.0). Our analysis across 7 regularization strengths reveals no significant correlation (r = -0.293, p = 0.523) between weight and activation orthogonality. These findings demonstrate that weight-space regularization neither achieves its geometric goal nor reliably improves performance, making it unsuitable for MoE diversity.