🤖 AI Summary
Large-scale AI systems lack lightweight, scalable distributed workload modeling mechanisms, hindering pre-deployment optimization and design-space exploration for LLMs. Method: This paper introduces the first scalable symbolic tensor graph generation framework, integrating tensor-level execution modeling with symbolic graph representation and a configurable parallelism policy generator. It supports simulation of distributed training across up to 32K GPUs, accurately capturing compute, memory, and communication behavior while remaining compatible with diverse parallelism paradigms (e.g., tensor, pipeline, data parallelism). Contribution/Results: The framework generates high-fidelity LLM execution traces with <5% error relative to real-system measurements—significantly outperforming existing black-box or coarse-grained modeling approaches. The open-sourced implementation establishes a new paradigm for co-design and performance prediction of LLM systems.
📝 Abstract
Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE is publicly available to facilitate further research in distributed machine learning systems: https://github.com/astra-sim/symbolic tensor graph