🤖 AI Summary
This work investigates the conditions under which signSGD outperforms SGD, focusing on linear regression settings where features and targets exhibit distinct power-law decay rates. Leveraging a random feature model, we derive an explicit expression for the population risk after a single pass of signSGD training, uncovering its unique drift normalization and noise reshaping effects. Theoretical analysis reveals that in noise-dominated regimes, signSGD achieves a steeper computation-optimal scaling law than SGD. Furthermore, by introducing a warmup-stable-decay (WSD) learning rate schedule, we substantially suppress noise in scenarios where features decay rapidly while the target decays slowly, enabling signSGD to surpass SGD comprehensively in both computational efficiency and generalization performance.
📝 Abstract
We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.