🤖 AI Summary
Standard weight decay uniformly penalizes all parameters, interfering with the optimizer’s original objective and hindering convergence to local optima of the unmodified loss. To address this, we propose Sign-Aligned Weight Decay (SAWD): it applies decay only to parameter coordinates whose signs align with those of the corresponding gradients, thereby preserving the original loss function exactly. SAWD further introduces a bilevel optimization mechanism that identifies locally Pareto-stationary points of the unmodified objective—without introducing additional hyperparameters. Inspired by sliding-mode control, SAWD employs sign-comparison logic for selective regularization and is compatible with mainstream optimizers including AdamW and Lion. Empirically, on pretraining billion-parameter language models and ImageNet classification, SAWD consistently reduces final loss and improves accuracy.
📝 Abstract
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.