veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the programming complexity and debugging challenges arising from 3D parallelism in large language model (LLM) distributed training, this paper proposes an eager-mode SPMD tensor programming framework. Our method introduces two key innovations: (1) a distributed random number generation algorithm compatible with arbitrary tensor sharding schemes, ensuring strict numerical equivalence between multi-device and single-device execution; and (2) optimizations to PyTorch primitive overhead and communication scheduling, enabling fine-grained tensor parallelism and flexible hybrid parallel strategies. Experimental evaluation demonstrates that, compared to state-of-the-art systems such as TorchTitan, our framework achieves a 2.2× speedup in training throughput, reduces code complexity by 78.4%, and maintains end-to-end numerical equivalence—thereby significantly improving debuggability and engineering practicality.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have scaled rapidly in size and complexity, requiring increasingly intricate parallelism for distributed training, such as 3D parallelism. This sophistication motivates a shift toward simpler, more debuggable programming paradigm like Single Program Multiple Data (SPMD). However, SPMD in eager execution introduces two key challenges: ensuring consistency with single-device execution and achieving high performance at scale. In this paper, we introduce veScale, an eager-mode training system that fully embraces SPMD paradigm to democratize distributed tensor programming. veScale addresses the prevalent issue of inconsistent results in systems like PyTorch by introducing a novel algorithm of distributed Random Number Generation (RNG) compatible with arbitrary sharded operators. veScale also significantly boosts training performance by reducing PyTorch primitive's overhead and improving communication efficiency. Evaluations show that veScale delivers up to 2.2x speedup over the state-of-the-art training systems, like TorchTitan, and cuts code complexity by 78.4%, while preserving single-device-equivalent results.

Problem

Research questions and friction points this paper is trying to address.

Ensuring consistency between distributed and single-device execution

Achieving high performance at scale with eager-mode SPMD

Reducing code complexity while maintaining training efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces eager-mode SPMD for distributed tensor programming

Novel distributed RNG algorithm ensuring consistency with sharded operators

Reduces PyTorch overhead and improves communication efficiency

🔎 Similar Papers

Galley: Modern Query Optimization for Sparse Tensor Programs