TOAST: Fast and scalable auto-partitioning based on principled static analysis

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automatic partitioning of large models across distributed accelerators faces three key challenges: an exponentially growing search space, high risk of out-of-memory (OOM) failures, and suboptimal or infeasible solutions due to heuristic pruning in existing tools. This paper proposes a novel hybrid approach integrating principled static compilation analysis with Monte Carlo Tree Search (MCTS). First, it models tensor dimension dependencies to precisely identify homogeneous sharding requirements and conflict constraints, thereby constructing a compact and feasible decision space. Second, it employs MCTS to efficiently explore this space while enforcing memory safety and execution efficiency. Evaluated across diverse hardware platforms and model architectures, our fully automated method discovers partitioning schemes that outperform industrial-grade baselines—including TensorFlow/XLA and DeepSpeed—achieving higher scalability, throughput, and zero GPU memory overflow.

Technology Category

Application Category

📝 Abstract
Partitioning large machine learning models across distributed accelerator systems is a complex process, requiring a series of interdependent decisions that are further complicated by internal sharding ambiguities. Consequently, existing auto-partitioners often suffer from out-of-memory errors or are prohibitively slow when exploring the exponentially large space of possible partitionings. To mitigate this, they artificially restrict the search space, but this approach frequently yields infeasible solutions that violate device memory constraints or lead to sub-optimal performance. We propose a system that combines a novel static compiler analysis with a Monte Carlo Tree Search. Our analysis constructs an efficient decision space by identifying (i) tensor dimensions requiring identical sharding, and (ii) partitioning "conflicts" that require resolution. Our system significantly outperforms state-of-the-art industrial methods across diverse hardware platforms and model architectures, discovering previously unknown, superior solutions, and the process is fully automated even for complex and large models.
Problem

Research questions and friction points this paper is trying to address.

Optimizing partitioning of large ML models across accelerators
Resolving sharding ambiguities and memory constraint violations
Automating efficient partitioning search without performance compromises
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel static compiler analysis for decision space
Monte Carlo Tree Search for partitioning optimization
Automated conflict resolution and sharding identification
🔎 Similar Papers
No similar papers found.
Sami Alabed
Sami Alabed
Research Scientist, DeepMind
Bayesian OptimizationDistributed SystemsMachine LearningReinforcement Learning
D
Dominik Grewe
Isomorphic Labs, London, UK
N
Norman A. Rink
Google DeepMind, London, UK
T
Timur Sitdikov
Google DeepMind, London, UK
A
Agnieszka Swietlik
Google DeepMind, London, UK
Dimitrios Vytiniotis
Dimitrios Vytiniotis
DeepMind
programming languagestype systemsfunctional programmingcompilers
D
Daniel Belov
Google DeepMind, London, UK