Cautious Weight Decay

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard weight decay uniformly penalizes all parameters, interfering with the optimizer’s original objective and hindering convergence to local optima of the unmodified loss. To address this, we propose Sign-Aligned Weight Decay (SAWD): it applies decay only to parameter coordinates whose signs align with those of the corresponding gradients, thereby preserving the original loss function exactly. SAWD further introduces a bilevel optimization mechanism that identifies locally Pareto-stationary points of the unmodified objective—without introducing additional hyperparameters. Inspired by sliding-mode control, SAWD employs sign-comparison logic for selective regularization and is compatible with mainstream optimizers including AdamW and Lion. Empirically, on pretraining billion-parameter language models and ImageNet classification, SAWD consistently reduces final loss and improves accuracy.

Technology Category

Application Category

📝 Abstract
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
Problem

Research questions and friction points this paper is trying to address.

Modifies weight decay to apply only when signs match optimizer updates
Preserves original loss function while enabling sliding-mode optimization behavior
Improves model performance without requiring new hyperparameters or tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Applies weight decay to aligned parameter signs
Preserves original loss with bilevel interpretation
Drop-in change for optimizers without hyperparameters
🔎 Similar Papers
No similar papers found.