🤖 AI Summary
Existing methods for predicting distributed training time commonly overlook the impact of floating-point precision—particularly mixed precision—leading to prediction errors as high as 147.85%. This work proposes the first precision-aware training time predictor, which overcomes the limitations of conventional static computation graph approaches by explicitly modeling the dynamic computational and communication overheads under varying precision configurations. By integrating precision-aware modeling, distributed system analysis, and machine learning techniques, the proposed method achieves an average absolute percentage error (MAPE) of 9.8% across diverse precision settings, substantially improving prediction accuracy and robustness in cross-precision scenarios.
📝 Abstract
Accurate prediction of training time in distributed deep learning is crucial for resource allocation, cost estimation, and job scheduling. We observe that the floating-point precision setting is a key determinant of training time, leading to training time variations of ~2.4x over its minimum. However, existing studies on distributed training time prediction rely on static model computation graphs that do not capture precision variations, including mixed precision. According to our experiments, training time prediction without considering precision results in significant prediction errors - reaching up to 147.85% in mean absolute percentage error (MAPE). To address this issue, we propose a precision-aware distributed training time predictor that achieves robust accuracy across diverse precision settings, including mixed precision, with 9.8% MAPE.