From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the high communication overhead of gradient synchronization in bandwidth-constrained distributed training, where existing low-rank optimizers remain inefficient. The authors propose TSR-Adam, the first Adam-style optimizer to incorporate bilateral low-rank communication. By synchronizing only the compact core matrix $ U^\top G V \in \mathbb{R}^{r \times r} $, TSR-Adam reduces per-step communication complexity from $ O(mn) $ to $ O(r^2) $. The method further integrates embedding-layer-specific rank selection and a randomized SVD refresh mechanism to eliminate full-gradient synchronization. This approach substantially lowers both communication and memory costs—achieving an average 13× reduction in communication during pretraining on models ranging from 60M to 1B parameters and a 25× reduction on GLUE fine-tuning—while maintaining comparable performance and providing theoretical stability guarantees.

Technology Category

Application Category

📝 Abstract

As foundation models continue to scale, pretraining increasingly relies on data-parallel distributed optimization, making bandwidth-limited gradient synchronization a key bottleneck. Orthogonally, projection-based low-rank optimizers were mainly designed for memory efficiency, but remain suboptimal for communication-limited training: one-sided synchronization still transmits an $O(rn)$ object for an $m\times n$ matrix gradient and refresh steps can dominate peak communicated bytes. We propose TSR, which brings two-sided low-rank communication to Adam-family updates (TSR-Adam) by synchronizing a compact core $U^\top G V\in\mathbb{R}^{r\times r}$, reducing the dominant per-step payload from $O(mn)$ to $O(r^2)$ while keeping moment states in low-dimensional cores. To further reduce the peak communication from subspace refresh, TSR-Adam adopts a randomized SVD-based refresh that avoids full-gradient synchronization. We additionally extend low-rank communication to embedding gradients with embedding-specific ranks and refresh schedules, yielding additional communication and memory savings over keeping embeddings dense. Across pretraining from 60M to 1B model scales, TSR-Adam reduces average communicated bytes per step by $13\times$, and on GLUE fine-tuning it reduces communication by $25\times$, while achieving comparable performance; we further provide a theoretical stationarity analysis for the proposed update. Code is available at https://github.com/DKmiyan/TSR-Adam.

Problem

Research questions and friction points this paper is trying to address.

distributed training

gradient synchronization

low-rank optimization

communication bottleneck

memory efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank communication

distributed optimization

Adam optimizer