STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale AI systems lack lightweight, scalable distributed workload modeling mechanisms, hindering pre-deployment optimization and design-space exploration for LLMs. Method: This paper introduces the first scalable symbolic tensor graph generation framework, integrating tensor-level execution modeling with symbolic graph representation and a configurable parallelism policy generator. It supports simulation of distributed training across up to 32K GPUs, accurately capturing compute, memory, and communication behavior while remaining compatible with diverse parallelism paradigms (e.g., tensor, pipeline, data parallelism). Contribution/Results: The framework generates high-fidelity LLM execution traces with <5% error relative to real-system measurements—significantly outperforming existing black-box or coarse-grained modeling approaches. The open-sourced implementation establishes a new paradigm for co-design and performance prediction of LLM systems.

Technology Category

Application Category

📝 Abstract
Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE is publicly available to facilitate further research in distributed machine learning systems: https://github.com/astra-sim/symbolic tensor graph
Problem

Research questions and friction points this paper is trying to address.

Generating scalable execution traces for distributed LLM workload modeling
Enabling system-level optimizations without requiring physical infrastructure access
Supporting exploration of diverse parallelization strategies across system configurations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Symbolic Tensor graph Generator synthesizes execution traces
Supports comprehensive parallelization strategies for LLMs
Scalable framework modeling 32K GPU configurations accurately
🔎 Similar Papers
No similar papers found.
C
Changhai Man
Georgia Institute of Technology
J
Joongun Park
Georgia Institute of Technology
H
Hanjiang Wu
Georgia Institute of Technology
H
Huan Xu
Georgia Institute of Technology
Srinivas Sridharan
Srinivas Sridharan
Nvidia Inc.
Tushar Krishna
Tushar Krishna
Associate Professor, Georgia Tech
Computer ArchitectureInterconnection NetworksNetwork-on-ChipDeep Learning Accelerators