🤖 AI Summary
To address the communication bottleneck caused by gradient synchronization in distributed training of large-scale models, this paper proposes a high-efficiency optimizer that preserves synchronous semantics. The method introduces three key innovations: (1) an orthogonal parameter update mechanism—the first of its kind—that decouples inter-device gradient dependencies; (2) device-level local momentum buffering, eliminating the need for all-reduce or global gradient exchange; and (3) a reconstruction-free scheduling strategy for large-matrix sharding and sparse communication. Crucially, the approach maintains numerical equivalence with standard DDP and FSDP—without compromising convergence accuracy or stability—while substantially reducing I/O overhead. Empirical evaluation demonstrates significant throughput improvement in multi-GPU training and scalable performance up to 1,000 GPUs.
📝 Abstract
Training large AI models efficiently requires distributing computation across multiple accelerators, but this often incurs significant communication overhead -- especially during gradient synchronization. We introduce Dion, a communication-efficient optimizer that retains the synchronous semantics of standard distributed training (e.g., DDP, FSDP) while substantially reducing I/O costs. Unlike conventional optimizers that synchronize full gradient matrices, Dion leverages orthonormalized updates with device-local momentum buffers, eliminating the need for full gradient exchange. It further supports an efficient sharding strategy that avoids reconstructing large matrices during training.