A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

241K/year
🤖 AI Summary
Training large language models relies on GPU clusters comprising thousands of devices, resulting in prohibitively high development and debugging costs and making it difficult to reproduce production-scale training behavior. This work proposes PrismLLM, a framework that constructs a high-fidelity model of computation, communication, and dependencies by slicing the execution graph and integrates a hybrid simulation mechanism combining real execution with virtual replay. PrismLLM accurately emulates large-scale training dynamics on drastically reduced hardware—using fewer than 1% of the original GPUs—and successfully reproduces training behavior at an 8,192-GPU scale with only 0.58% error in iteration time and less than 0.01% error in peak GPU memory usage. This approach substantially lowers the barrier to entry for large-model development and significantly enhances debugging efficiency.
📝 Abstract
Large language model (LLM) training today runs on clusters spanning thousands of GPUs. While this scale enables rapid model advances, developing, debugging, and performance-tuning the training framework inevitably becomes complex and costly. This is because engineers often need to reproduce production behaviors to diagnose failures or evaluate optimizations, thereby demanding frequent and even exclusive access to production-scale clusters -- which becomes increasingly hard given that the majority of GPUs are already committed to production workloads. Simulation relies on complex performance models that are difficult to maintain, and downscaled experiments often fail to capture scale-dependent behaviors. We present PrismLLM to decouple large-scale execution from the need to access large clusters, enabling engineers to run and observe ranks of interest under faithful large-scale behavior using only a few GPUs. PrismLLM constructs a high-fidelity execution graph via a slicing-based approach that captures computation, communication, and dependencies of the target scale. Then, PrismLLM performs hybrid emulation where selected ranks execute the original program while the remaining ranks are replayed as virtual participants. Experiments on large-scale LLM training workloads show that PrismLLM accurately reproduces performance and memory behavior, achieving only 0.58\% average error in iteration time and less than 0.01\% error in peak GPU memory usage. PrismLLM can emulate clusters of up to 8192 GPUs using fewer than 1\% of the physical GPUs required by the original deployment.
Problem

Research questions and friction points this paper is trying to address.

large language model
training emulation
GPU cluster
scale-dependent behavior
performance debugging
Innovation

Methods, ideas, or system contributions that make the work stand out.

PrismLLM
LLM training emulation
execution graph slicing
hybrid simulation
scale-decoupled debugging
🔎 Similar Papers
No similar papers found.