TOAST: Fast and scalable auto-partitioning based on principled static analysis

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Automatic partitioning of large models across distributed accelerators faces three key challenges: an exponentially growing search space, high risk of out-of-memory (OOM) failures, and suboptimal or infeasible solutions due to heuristic pruning in existing tools. This paper proposes a novel hybrid approach integrating principled static compilation analysis with Monte Carlo Tree Search (MCTS). First, it models tensor dimension dependencies to precisely identify homogeneous sharding requirements and conflict constraints, thereby constructing a compact and feasible decision space. Second, it employs MCTS to efficiently explore this space while enforcing memory safety and execution efficiency. Evaluated across diverse hardware platforms and model architectures, our fully automated method discovers partitioning schemes that outperform industrial-grade baselines—including TensorFlow/XLA and DeepSpeed—achieving higher scalability, throughput, and zero GPU memory overflow.

Technology Category

Application Category

📝 Abstract

Partitioning large machine learning models across distributed accelerator systems is a complex process, requiring a series of interdependent decisions that are further complicated by internal sharding ambiguities. Consequently, existing auto-partitioners often suffer from out-of-memory errors or are prohibitively slow when exploring the exponentially large space of possible partitionings. To mitigate this, they artificially restrict the search space, but this approach frequently yields infeasible solutions that violate device memory constraints or lead to sub-optimal performance. We propose a system that combines a novel static compiler analysis with a Monte Carlo Tree Search. Our analysis constructs an efficient decision space by identifying (i) tensor dimensions requiring identical sharding, and (ii) partitioning "conflicts" that require resolution. Our system significantly outperforms state-of-the-art industrial methods across diverse hardware platforms and model architectures, discovering previously unknown, superior solutions, and the process is fully automated even for complex and large models.

Problem

Research questions and friction points this paper is trying to address.

Optimizing partitioning of large ML models across accelerators

Resolving sharding ambiguities and memory constraint violations

Automating efficient partitioning search without performance compromises

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel static compiler analysis for decision space

Monte Carlo Tree Search for partitioning optimization

Automated conflict resolution and sharding identification

🔎 Similar Papers

No similar papers found.