Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the impact of insufficient inner-loop convergence in data mixing training, which introduces hypergradient approximation errors and degrades model robustness. Under a fixed total update budget, the study addresses the problem of selecting the optimal number of inner-loop steps \( T \) to achieve effective convergence. Leveraging a bilevel optimization framework and strong convexity assumptions, the authors combine theoretical analysis with numerical validation to demonstrate— for the first time—that the commonly adopted greedy strategy (\( T = 1 \)) can fail. They further establish the asymptotically optimal relationship between \( T \) and the total budget \( N \): \( \Theta(\log N) \) in the full-gradient setting and \( \Theta(\sqrt{N \log N}) \) under stochastic gradients. This work provides a quantitative foundation for balancing computational cost and optimization accuracy in practical training scenarios.

Technology Category

Application Category

📝 Abstract
Data mixing--the strategic reweighting of training domains--is a critical component in training robust machine learning models. This problem is naturally formulated as a bilevel optimization task, where the outer loop optimizes domain weights to minimize validation loss, and the inner loop optimizes model parameters to minimize the weighted training loss. Classical bilevel optimization relies on hypergradients, which theoretically require the inner optimization to reach convergence. However, due to computational constraints, state-of-the-art methods use a finite, often small, number of inner update steps before updating the weights. The theoretical implications of this approximation are not well understood. In this work, we rigorously analyze the convergence behavior of data mixing with a finite number of inner steps $T$. We prove that the "greedy" practical approach of using $T=1$ can fail even in a simple quadratic example. Under a fixed parameter update budget $N$ and assuming the per-domain losses are strongly convex, we show that the optimal $T$ scales as $Θ(\log N)$ (resp., $Θ({(N \log N)}^{1/2})$) for the data mixing problem with access to full (resp., stochastic) gradients. We complement our theoretical results with proof-of-concept experiments.
Problem

Research questions and friction points this paper is trying to address.

bilevel optimization
data mixing
convergence analysis
hypergradients
inner-loop updates
Innovation

Methods, ideas, or system contributions that make the work stand out.

bilevel optimization
data mixing
convergence analysis
hypergradient approximation
inner-loop steps