Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high communication overhead and poor generalization of local gradient methods in distributed data-parallel training, this paper proposes a co-optimization framework balancing communication efficiency and generalization. Methodologically, it introduces (1) the *Inverse Mean Valley*—a differentiable, scale-invariant sharpness measure; (2) the *Distributed Push-Pull Force (DPPF)* algorithm, which employs lightweight local regularization and relaxed synchronization to dynamically balance global consistency and local smoothness under non-convex settings; and (3) theoretical convergence guarantees with provably reduced communication frequency. Experiments demonstrate that DPPF consistently converges to flatter minima, improving model generalization, while incurring significantly lower communication costs than state-of-the-art local SGD and synchronous averaging methods.

Technology Category

Application Category

📝 Abstract
We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while significantly reducing communication overhead. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.
Problem

Research questions and friction points this paper is trying to address.

Improve communication efficiency in distributed DNN training
Enhance model generalization via flat-minima optimization
Balance pull-push dynamics for collaborative wide minima search
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Inverse Mean Valley sharpness measure
Uses Distributed Pull-Push Force algorithm
Encourages collaborative wide minima seeking
🔎 Similar Papers
No similar papers found.
T
Tolga Dimlioglu
New York University, Tandon School of Engineering, Brooklyn, NY 11201
Anna Choromanska
Anna Choromanska
New York University
machine learning