🤖 AI Summary
Performance estimation for ML systems on large-scale GPU clusters faces challenges including modeling complexity, high overhead, and poor generalizability. This paper proposes a lightweight, high-fidelity real-time performance estimation framework that requires no modifications to models or frameworks. It leverages dynamic binary instrumentation to intercept GPU operations, integrates GPU semantic modeling, event-driven scheduling, and asynchronous I/O simulation—achieving, for the first time, tight integration between live-system execution and network-level simulation. The approach eliminates reliance on manually labeled data or static workloads. Deployed on a single GPU, it achieves state-of-the-art simulation accuracy (error < 8%), reduces human effort by 90%, and enables sub-second performance prediction for thousand-GPU cluster configurations. The framework significantly enhances generality, usability, and scalability.
📝 Abstract
To accommodate ever-increasing model complexity, modern machine learning (ML) systems have to scale to large GPU clusters. Changes in ML model architecture, ML system implementation, and cluster configuration can significantly affect overall ML system performance. However, quantifying the performance impact before deployment is challenging. Existing performance estimation methods use performance modeling or static workload simulation. These techniques are not general: they requires significant human effort and computation capacity to generate training data or a workload. It is also difficult to adapt ML systems to use these techniques. This paper introduces, Phantora, a live GPU cluster simulator for performance estimation. Phantora runs minimally modified ML models and frameworks, intercepting and simulating GPU-related operations to enable high-fidelity performance estimation. Phantora overcomes several research challenges in integrating an event-driven network simulator with live system execution, and introduces a set of techniques to improve simulation speed, scalability, and accuracy. Our evaluation results show that Phantora can deliver similar estimation accuracy to the state-of-the-art workload simulation approach with only one GPU, while reducing human effort and increasing generalizability.