🤖 AI Summary
This work investigates the robustness of first-order optimization algorithms under relative gradient errors—such as those induced by GPU-based gradient quantization and compression. Focusing on three canonical methods—constant-stepsize gradient descent, long-step (i.e., large-stepsize) methods, and acceleration schemes—we establish, for the first time, a theoretical characterization demonstrating that both long-step and accelerated methods are inherently non-robust to relative gradient perturbations. To address this, we propose a semi-heuristic stepsize reduction strategy, integrated with performance estimation problem (PEP) analysis, rigorous convergence theory, and large-scale GPU simulations under realistic relative error models. Our results show that the proposed strategy significantly enhances stability for long-step methods and enables all three algorithm classes to achieve strong convergence under inexact gradients. Notably, empirical evidence reveals that accelerated methods exhibit substantially greater practical robustness than current theory predicts—offering new insights and practical guidance for distributed and low-precision training.
📝 Abstract
This work assesses both empirically and theoretically, using the performance estimation methodology, how robust different first-order optimization methods are when subject to relative inexactness in their gradient computations. Relative inexactness occurs, for example, when compressing the gradient using fewer bits of information, which happens when dealing with large-scale problems on GPUs. Three major families of methods are analyzed: constant step gradient descent, long-step methods, and accelerated methods. The latter two are first shown to be theoretically not robust to inexactness. Then, a semi-heuristic shortening factor is introduced to improve their theoretical guarantees. All methods are subsequently tested on a concrete inexact problem, with two different types of relative inexactness, and it is observed that both accelerated methods are much more robust than expected, and that the shortening factor significantly helps the long-step methods. In the end, all shortened methods appear to be promising, even in this inexact setting.