Improved Convergence in Parameter-Agnostic Error Feedback through Momentum

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing distributed error feedback (EF) algorithms under communication compression require manual tuning of step sizes using problem-specific prior parameters—such as the smoothness constant—hindering their practicality in large-scale neural network training. This work proposes a class of parameter-free adaptive EF algorithms that, for the first time, unify normalized error feedback with diverse momentum mechanisms—including Polyak, implicit gradient transport (IGT), STORM, and Hessian-corrected variants—and introduce a time-varying adaptive step-size strategy. Theoretically, under non-convex smooth optimization, the proposed methods achieve near-optimal convergence rates of $O(1/T^{1/4})$–$O(1/T^{1/3})$. Empirically, they significantly improve training stability and convergence speed under communication compression, while eliminating all manual hyperparameter tuning. This substantially enhances the robustness and usability of distributed deep learning systems.

Technology Category

Application Category

📝 Abstract

Communication compression is essential for scalable distributed training of modern machine learning models, but it often degrades convergence due to the noise it introduces. Error Feedback (EF) mechanisms are widely adopted to mitigate this issue of distributed compression algorithms. Despite their popularity and training efficiency, existing distributed EF algorithms often require prior knowledge of problem parameters (e.g., smoothness constants) to fine-tune stepsizes. This limits their practical applicability especially in large-scale neural network training. In this paper, we study normalized error feedback algorithms that combine EF with normalized updates, various momentum variants, and parameter-agnostic, time-varying stepsizes, thus eliminating the need for problem-dependent tuning. We analyze the convergence of these algorithms for minimizing smooth functions, and establish parameter-agnostic complexity bounds that are close to the best-known bounds with carefully-tuned problem-dependent stepsizes. Specifically, we show that normalized EF21 achieve the convergence rate of near ${O}(1/T^{1/4})$ for Polyak's heavy-ball momentum, ${O}(1/T^{2/7})$ for Iterative Gradient Transport (IGT), and ${O}(1/T^{1/3})$ for STORM and Hessian-corrected momentum. Our results hold with decreasing stepsizes and small mini-batches. Finally, our empirical experiments confirm our theoretical insights.

Problem

Research questions and friction points this paper is trying to address.

Addressing convergence degradation in distributed training due to communication compression noise

Eliminating dependency on prior problem parameters for error feedback algorithm tuning

Establishing parameter-agnostic convergence bounds comparable to tuned methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining error feedback with normalized updates

Integrating various momentum variants for acceleration

Using parameter-agnostic time-varying stepsizes

🔎 Similar Papers

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks