🤖 AI Summary
To address the challenge of efficiently optimizing high-dimensional configuration spaces in large-scale machine learning training, this paper proposes a scalable meta-gradient computation algorithm and the Smooth Model Training (SMT) framework—enabling, for the first time, end-to-end, differentiable joint optimization of training strategies. Methodologically, it integrates reverse-mode automatic differentiation through training loops, smooth modeling of training trajectories, and meta-gradient descent (MGD) to jointly optimize data selection, poisoning-resilient strategies, and learning rate scheduling. Key contributions are: (1) a breakthrough in scalable meta-gradient computation for large-scale training; and (2) the SMT framework, which ensures stability and convergence of MGD under realistic dynamic training conditions. Experiments demonstrate that the proposed data selection method significantly outperforms existing approaches; robustness against accuracy-degrading data poisoning attacks improves by an order of magnitude; and the fully automated learning rate scheduler matches or exceeds hand-crafted designs in performance.
📝 Abstract
A major challenge in training large-scale machine learning models is configuring the training process to maximize model performance, i.e., finding the best training setup from a vast design space. In this work, we unlock a gradient-based approach to this problem. We first introduce an algorithm for efficiently calculating metagradients -- gradients through model training -- at scale. We then introduce a"smooth model training"framework that enables effective optimization using metagradients. With metagradient descent (MGD), we greatly improve on existing dataset selection methods, outperform accuracy-degrading data poisoning attacks by an order of magnitude, and automatically find competitive learning rate schedules.