Optimizing ML Training with Metagradient Descent

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

To address the challenge of efficiently optimizing high-dimensional configuration spaces in large-scale machine learning training, this paper proposes a scalable meta-gradient computation algorithm and the Smooth Model Training (SMT) framework—enabling, for the first time, end-to-end, differentiable joint optimization of training strategies. Methodologically, it integrates reverse-mode automatic differentiation through training loops, smooth modeling of training trajectories, and meta-gradient descent (MGD) to jointly optimize data selection, poisoning-resilient strategies, and learning rate scheduling. Key contributions are: (1) a breakthrough in scalable meta-gradient computation for large-scale training; and (2) the SMT framework, which ensures stability and convergence of MGD under realistic dynamic training conditions. Experiments demonstrate that the proposed data selection method significantly outperforms existing approaches; robustness against accuracy-degrading data poisoning attacks improves by an order of magnitude; and the fully automated learning rate scheduler matches or exceeds hand-crafted designs in performance.

Technology Category

Application Category

📝 Abstract

A major challenge in training large-scale machine learning models is configuring the training process to maximize model performance, i.e., finding the best training setup from a vast design space. In this work, we unlock a gradient-based approach to this problem. We first introduce an algorithm for efficiently calculating metagradients -- gradients through model training -- at scale. We then introduce a"smooth model training"framework that enables effective optimization using metagradients. With metagradient descent (MGD), we greatly improve on existing dataset selection methods, outperform accuracy-degrading data poisoning attacks by an order of magnitude, and automatically find competitive learning rate schedules.

Problem

Research questions and friction points this paper is trying to address.

Optimizing training setup for large-scale ML models

Efficiently calculating metagradients for model training

Improving dataset selection and learning rate schedules

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient metagradient calculation at scale

Smooth model training framework optimization

Metagradient descent improves dataset selection

🔎 Similar Papers

MetaOptimize: A Framework for Optimizing Step Sizes and Other Meta-parameters