🤖 AI Summary
This work addresses the inefficiency in existing distributed training caused by the incompatibility between matrix optimizers—which require holistic updates—and tensor sharding, leading to redundant synchronization and poor communication efficiency. To resolve this, the authors propose Canzona, a framework that decouples logical optimizer assignment from physical parameter distribution, enabling unified, asynchronous, and load-balanced matrix optimization. Key innovations include the first α-Balanced static partitioning strategy, which ensures both load balance and atomicity under data parallelism, and an asynchronous computation pipeline based on Micro-Group scheduling to enhance communication efficiency in tensor parallelism. Evaluated on 256 GPUs training Qwen3 models (up to 32B parameters), Canzona achieves a 1.57× speedup in end-to-end iteration time and reduces optimizer step latency by 5.8×.
📝 Abstract
The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.