LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenges of bandwidth-limited communication and degraded performance of low-rank optimization in distributed training, where local updates lack access to full-batch gradients. The authors propose a novel approach that integrates low-rank optimization with infrequent synchronization by constructing global projections via pseudo-gradients and introducing a full-rank quasi-hyperbolic momentum update to restore subspace exploration capability. This method is the first to effectively combine low-rank representation with low-frequency communication, overcoming the bottleneck of insufficient local gradient information while preserving the expressiveness of the optimization trajectory. Experiments demonstrate that across model sizes from 125M to 720M parameters, the approach matches the performance of low-rank DDP while reducing communication volume by approximately 10×, and further excels in extremely memory-constrained scenarios.

Technology Category

Application Category

📝 Abstract

Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

Problem

Research questions and friction points this paper is trying to address.

distributed training

low-rank optimization

infrequent communication

optimizer states

gradient projection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank optimization

Infrequent communication

Distributed training