🤖 AI Summary
Large-scale hyperparameter tuning is computationally expensive and lacks rigorous theoretical foundations. Method: This paper introduces the “trajectory invariance” principle, demonstrating that coupling learning rate and weight decay induces near-identical training loss curves, gradient noise profiles, and gradient norm dynamics across diverse hyperparameter configurations—effectively collapsing the two-dimensional tuning space into a one-dimensional manifold. Contribution/Results: This principle establishes the first universal guiding principle for hyperparameter optimization, substantially reducing search dimensionality and tuning cost. It revises existing scaling laws and challenges conventional assumptions—such as independent tuning of learning rate and weight decay. Validated across multiple architectures and tasks, the principle is grounded in pretraining loss analysis, gradient noise modeling, and empirical gradient norm observation, confirming its broad applicability and practical utility.
📝 Abstract
As hyperparameter tuning becomes increasingly costly at scale, efficient tuning methods are essential. Yet principles for guiding hyperparameter tuning remain limited. In this work, we seek to establish such principles by considering a broad range of hyperparameters, including batch size, learning rate, and weight decay. We identify a phenomenon we call trajectory invariance, where pre-training loss curves, gradient noise, and gradient norm exhibit invariance--closely overlapping--with respect to a quantity that combines learning rate and weight decay. This phenomenon effectively reduces the original two-dimensional hyperparameter space to one dimension, yielding an efficient tuning rule: follow the salient direction revealed by trajectory invariance. Furthermore, we refine previous scaling laws and challenge several existing viewpoints. Overall, our work proposes new principles for efficient tuning and inspires future research on scaling laws.