🤖 AI Summary
Large-scale LLM distributed training suffers from unpredictable performance, poor generalizability of existing analytical models, and high overhead in simulation-based approaches.
Method: This paper proposes the first trajectory-driven, end-to-end performance modeling framework. It leverages real execution traces to enable operator-level time decomposition, dynamic characterization of communication-computation overlap, and configuration-aware interpolation—enabling zero-shot prediction of training latency and runtime behavior for unseen deployment configurations, without retraining or expensive simulation.
Contribution/Results: Evaluated on a 512-GPU H100 cluster across multiple GPT-3 variants, the framework achieves a mean latency prediction error of only 3.3% and estimates new-configuration performance in under one second. It significantly improves cross-model and cross-configuration performance predictability and optimization efficiency while maintaining low computational overhead.
📝 Abstract
Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving high efficiency in practice remains difficult. Accurate performance models that effectively characterize and predict a model's behavior are essential for guiding optimization efforts and system-level studies. We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training, designed to accurately capture and predict the execution behaviors of modern LLMs. We evaluate Lumos on a production ML cluster with up to 512 NVIDIA H100 GPUs using various GPT-3 variants, demonstrating that it can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations. Additionally, we validate its ability to estimate performance for new setups from existing traces, facilitating efficient exploration of model and deployment configurations.