AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to efficiently utilize heterogeneous computing resources under dynamic workloads, resulting in high communication overhead, slow convergence, and low throughput in LLM distributed training. This paper proposes DiLoCo, a three-stage co-optimization framework: (1) multi-instance parallel training with knowledge fusion; (2) hardware-aware dynamic adjustment of local batch sizes; and (3) automatic fallback to gradient accumulation upon exceeding hardware capacity to ensure training stability. By theoretically modeling the relationship between communication rounds and convergence, DiLoCo achieves computation-communication rebalancing, significantly reducing synchronization latency and idle time. Experiments demonstrate that DiLoCo maintains model accuracy while improving training throughput by 2.1×, accelerating convergence by 37%, and reducing communication overhead by 42%.

Technology Category

Application Category

📝 Abstract
Scaling distributed training of Large Language Models (LLMs) requires not only algorithmic advances but also efficient utilization of heterogeneous hardware resources. While existing methods such as DiLoCo have demonstrated promising results, they often fail to fully exploit computational clusters under dynamic workloads. To address this limitation, we propose a three-stage method that combines Multi-Instance Training (MIT), Adaptive Batched DiLoCo, and switch mode mechanism. MIT allows individual nodes to run multiple lightweight training streams with different model instances in parallel and merge them to combine knowledge, increasing throughput and reducing idle time. Adaptive Batched DiLoCo dynamically adjusts local batch sizes to balance computation and communication, substantially lowering synchronization delays. Switch mode further stabilizes training by seamlessly introducing gradient accumulation once adaptive batch sizes grow beyond hardware-friendly limits. Together, these innovations improve both convergence speed and system efficiency. We also provide a theoretical estimate of the number of communications required for the full convergence of a model trained using our method.
Problem

Research questions and friction points this paper is trying to address.

Improving communications efficiency in distributed LLM training
Optimizing heterogeneous hardware utilization under dynamic workloads
Reducing synchronization delays while maintaining convergence stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Instance Training with parallel lightweight streams
Adaptive Batched DiLoCo dynamically adjusts batch sizes
Switch mode introduces gradient accumulation stabilization
🔎 Similar Papers
No similar papers found.
Nikolay Kutuzov
Nikolay Kutuzov
Moscow Institute of Physics and Technology
Machine learningconvex optimization
M
Makar Baderko
The School of the Center for Teacher Excellence, Moscow, Russia
Stepan Kulibaba
Stepan Kulibaba
Research Center of the Artificial Intelligence Institute, Innopolis University
LLMmulti agent systemconvex optimization
A
Artem Dzhalilov
Research Center of the Artificial Intelligence Institute, Innopolis University, Innopolis, Russia
D
Daniel Bobrov
Sirius University of Science and Technology, Sirius, Russia
M
Maxim Mashtaler
Department Mathematical Foundations of Control, Moscow Institute of Physics and Technology, Moscow, Russia
Alexander Gasnikov
Alexander Gasnikov
Innopolis University
convex optimizationAI