🤖 AI Summary
Existing Fully Sharded Data Parallel (FSDP) systems are constrained by fixed sharding formats, which hinder efficient support for structure-aware training techniques—such as block-wise quantization—and non-elementwise optimizers like Shampoo and Muon, thereby limiting scalability to ten-thousand-GPU clusters. This work proposes RaggedShard, a flexible sharding mechanism coupled with a structure-aware scheduling algorithm, which natively supports these advanced methods within FSDP while maintaining minimal code intrusion. Experimental results demonstrate that the proposed approach achieves 5%–66% higher throughput and reduces memory consumption by 16%–30% on large-scale GPU clusters, successfully enabling efficient ten-thousand-GPU-scale training.
📝 Abstract
Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.