STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Large-scale AI systems lack lightweight, scalable distributed workload modeling mechanisms, hindering pre-deployment optimization and design-space exploration for LLMs. Method: This paper introduces the first scalable symbolic tensor graph generation framework, integrating tensor-level execution modeling with symbolic graph representation and a configurable parallelism policy generator. It supports simulation of distributed training across up to 32K GPUs, accurately capturing compute, memory, and communication behavior while remaining compatible with diverse parallelism paradigms (e.g., tensor, pipeline, data parallelism). Contribution/Results: The framework generates high-fidelity LLM execution traces with <5% error relative to real-system measurements—significantly outperforming existing black-box or coarse-grained modeling approaches. The open-sourced implementation establishes a new paradigm for co-design and performance prediction of LLM systems.

Technology Category

Application Category

📝 Abstract

Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE is publicly available to facilitate further research in distributed machine learning systems: https://github.com/astra-sim/symbolic tensor graph

Problem

Research questions and friction points this paper is trying to address.

Generating scalable execution traces for distributed LLM workload modeling

Enabling system-level optimizations without requiring physical infrastructure access

Supporting exploration of diverse parallelization strategies across system configurations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Symbolic Tensor graph Generator synthesizes execution traces

Supports comprehensive parallelization strategies for LLMs

Scalable framework modeling 32K GPU configurations accurately

🔎 Similar Papers

No similar papers found.