Dion: A Communication-Efficient Optimizer for Large Models

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the communication bottleneck caused by gradient synchronization in distributed training of large-scale models, this paper proposes a high-efficiency optimizer that preserves synchronous semantics. The method introduces three key innovations: (1) an orthogonal parameter update mechanism—the first of its kind—that decouples inter-device gradient dependencies; (2) device-level local momentum buffering, eliminating the need for all-reduce or global gradient exchange; and (3) a reconstruction-free scheduling strategy for large-matrix sharding and sparse communication. Crucially, the approach maintains numerical equivalence with standard DDP and FSDP—without compromising convergence accuracy or stability—while substantially reducing I/O overhead. Empirical evaluation demonstrates significant throughput improvement in multi-GPU training and scalable performance up to 1,000 GPUs.

Technology Category

Application Category

📝 Abstract
Training large AI models efficiently requires distributing computation across multiple accelerators, but this often incurs significant communication overhead -- especially during gradient synchronization. We introduce Dion, a communication-efficient optimizer that retains the synchronous semantics of standard distributed training (e.g., DDP, FSDP) while substantially reducing I/O costs. Unlike conventional optimizers that synchronize full gradient matrices, Dion leverages orthonormalized updates with device-local momentum buffers, eliminating the need for full gradient exchange. It further supports an efficient sharding strategy that avoids reconstructing large matrices during training.
Problem

Research questions and friction points this paper is trying to address.

Reduces communication overhead in distributed large model training
Eliminates full gradient exchange with orthonormalized updates
Avoids matrix reconstruction via efficient gradient sharding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthonormalized updates with local momentum
Efficient sharding strategy for large matrices
Reduces I/O costs in distributed training
🔎 Similar Papers
No similar papers found.