Dion: A Communication-Efficient Optimizer for Large Models

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

To address the communication bottleneck caused by gradient synchronization in distributed training of large-scale models, this paper proposes a high-efficiency optimizer that preserves synchronous semantics. The method introduces three key innovations: (1) an orthogonal parameter update mechanism—the first of its kind—that decouples inter-device gradient dependencies; (2) device-level local momentum buffering, eliminating the need for all-reduce or global gradient exchange; and (3) a reconstruction-free scheduling strategy for large-matrix sharding and sparse communication. Crucially, the approach maintains numerical equivalence with standard DDP and FSDP—without compromising convergence accuracy or stability—while substantially reducing I/O overhead. Empirical evaluation demonstrates significant throughput improvement in multi-GPU training and scalable performance up to 1,000 GPUs.

Technology Category

Application Category

📝 Abstract

Training large AI models efficiently requires distributing computation across multiple accelerators, but this often incurs significant communication overhead -- especially during gradient synchronization. We introduce Dion, a communication-efficient optimizer that retains the synchronous semantics of standard distributed training (e.g., DDP, FSDP) while substantially reducing I/O costs. Unlike conventional optimizers that synchronize full gradient matrices, Dion leverages orthonormalized updates with device-local momentum buffers, eliminating the need for full gradient exchange. It further supports an efficient sharding strategy that avoids reconstructing large matrices during training.

Problem

Research questions and friction points this paper is trying to address.

Reduces communication overhead in distributed large model training

Eliminates full gradient exchange with orthonormalized updates

Avoids matrix reconstruction via efficient gradient sharding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthonormalized updates with local momentum

Efficient sharding strategy for large matrices

Reduces I/O costs in distributed training

🔎 Similar Papers

No similar papers found.