🤖 AI Summary
This work challenges the conventional view that consensus error in decentralized training is inherently detrimental to convergence and generalization. The authors propose Decentralized SGD with Adaptive Consensus (DSGD-AC), which deliberately preserves a non-vanishing consensus error through a time-varying scaling mechanism, transforming it into a structured perturbation that steers optimization toward flatter minima. This approach reveals, for the first time, that consensus error can act as a beneficial implicit regularizer—contrary to its traditional interpretation as mere noise. Empirical evaluations on image classification and machine translation benchmarks demonstrate that DSGD-AC consistently outperforms both standard decentralized SGD and centralized SGD in terms of test accuracy and solution flatness.
📝 Abstract
Decentralized training is often regarded as inferior to centralized training because the consensus errors between workers are thought to undermine convergence and generalization, even with homogeneous data distributions. This work challenges this view by introducing decentralized SGD with Adaptive Consensus (DSGD-AC), which intentionally preserves non-vanishing consensus errors through a time-dependent scaling mechanism. We prove that these errors are not random noise but systematically align with the dominant Hessian subspace, acting as structured perturbations that guide optimization toward flatter minima. Across image classification and machine translation benchmarks, DSGD-AC consistently surpasses both standard DSGD and centralized SGD in test accuracy and solution flatness. Together, these results establish consensus errors as a useful implicit regularizer and open a new perspective on the design of decentralized learning algorithms.