🤖 AI Summary
To address the programming complexity and debugging challenges arising from 3D parallelism in large language model (LLM) distributed training, this paper proposes an eager-mode SPMD tensor programming framework. Our method introduces two key innovations: (1) a distributed random number generation algorithm compatible with arbitrary tensor sharding schemes, ensuring strict numerical equivalence between multi-device and single-device execution; and (2) optimizations to PyTorch primitive overhead and communication scheduling, enabling fine-grained tensor parallelism and flexible hybrid parallel strategies. Experimental evaluation demonstrates that, compared to state-of-the-art systems such as TorchTitan, our framework achieves a 2.2× speedup in training throughput, reduces code complexity by 78.4%, and maintains end-to-end numerical equivalence—thereby significantly improving debuggability and engineering practicality.
📝 Abstract
Large Language Models (LLMs) have scaled rapidly in size and complexity, requiring increasingly intricate parallelism for distributed training, such as 3D parallelism. This sophistication motivates a shift toward simpler, more debuggable programming paradigm like Single Program Multiple Data (SPMD). However, SPMD in eager execution introduces two key challenges: ensuring consistency with single-device execution and achieving high performance at scale. In this paper, we introduce veScale, an eager-mode training system that fully embraces SPMD paradigm to democratize distributed tensor programming. veScale addresses the prevalent issue of inconsistent results in systems like PyTorch by introducing a novel algorithm of distributed Random Number Generation (RNG) compatible with arbitrary sharded operators. veScale also significantly boosts training performance by reducing PyTorch primitive's overhead and improving communication efficiency. Evaluations show that veScale delivers up to 2.2x speedup over the state-of-the-art training systems, like TorchTitan, and cuts code complexity by 78.4%, while preserving single-device-equivalent results.