Regularized Top-$k$: A Bayesian Framework for Gradient Sparsification

πŸ“… 2025-01-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address error accumulation and learning-rate imbalance caused by gradient sparsification (e.g., Top-$k$) in distributed training, this paper proposes RegTop-$k$: the first method modeling gradient selection as a Bayesian inverse probability problem. It dynamically generates adaptive sparsity masks via maximum a posteriori estimation and explicitly regularizes learning-rate scaling induced by accumulated errors. RegTop-$k$ overcomes the fundamental trade-off in conventional Top-$k$β€”where fixed compression ratios preclude guaranteed convergence. Theoretical analysis proves that, under linear regression, RegTop-$k$ achieves globally optimal convergence, whereas Top-$k$ stagnates at a biased plateau. Empirical evaluation on ResNet-18/CIFAR-10 demonstrates that RegTop-$k$ significantly reduces communication overhead while consistently attaining higher test accuracy than Top-$k$.

Technology Category

Application Category

πŸ“ Abstract
Error accumulation is effective for gradient sparsification in distributed settings: initially-unselected gradient entries are eventually selected as their accumulated error exceeds a certain level. The accumulation essentially behaves as a scaling of the learning rate for the selected entries. Although this property prevents the slow-down of lateral movements in distributed gradient descent, it can deteriorate convergence in some settings. This work proposes a novel sparsification scheme that controls the learning rate scaling of error accumulation. The development of this scheme follows two major steps: first, gradient sparsification is formulated as an inverse probability (inference) problem, and the Bayesian optimal sparsification mask is derived as a maximum-a-posteriori estimator. Using the prior distribution inherited from Top-$k$, we derive a new sparsification algorithm which can be interpreted as a regularized form of Top-$k$. We call this algorithm regularized Top-$k$ (RegTop-$k$). It utilizes past aggregated gradients to evaluate posterior statistics of the next aggregation. It then prioritizes the local accumulated gradient entries based on these posterior statistics. We validate our derivation through numerical experiments. In distributed linear regression, it is observed that while Top-$k$ remains at a fixed distance from the global optimum, RegTop-$k$ converges to the global optimum at significantly higher compression ratios. We further demonstrate the generalization of this observation by employing RegTop-$k$ in distributed training of ResNet-18 on CIFAR-10, where it noticeably outperforms Top-$k$.
Problem

Research questions and friction points this paper is trying to address.

Distributed Computing
Gradient Computation
Error Accumulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

RegTop-k
gradient reduction
prediction accuracy
πŸ”Ž Similar Papers
No similar papers found.