Training Time Prediction for Mixed Precision-based Distributed Training

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Existing methods for predicting distributed training time commonly overlook the impact of floating-point precision—particularly mixed precision—leading to prediction errors as high as 147.85%. This work proposes the first precision-aware training time predictor, which overcomes the limitations of conventional static computation graph approaches by explicitly modeling the dynamic computational and communication overheads under varying precision configurations. By integrating precision-aware modeling, distributed system analysis, and machine learning techniques, the proposed method achieves an average absolute percentage error (MAPE) of 9.8% across diverse precision settings, substantially improving prediction accuracy and robustness in cross-precision scenarios.

Technology Category

Application Category

📝 Abstract
Accurate prediction of training time in distributed deep learning is crucial for resource allocation, cost estimation, and job scheduling. We observe that the floating-point precision setting is a key determinant of training time, leading to training time variations of ~2.4x over its minimum. However, existing studies on distributed training time prediction rely on static model computation graphs that do not capture precision variations, including mixed precision. According to our experiments, training time prediction without considering precision results in significant prediction errors - reaching up to 147.85% in mean absolute percentage error (MAPE). To address this issue, we propose a precision-aware distributed training time predictor that achieves robust accuracy across diverse precision settings, including mixed precision, with 9.8% MAPE.
Problem

Research questions and friction points this paper is trying to address.

training time prediction
mixed precision
distributed training
floating-point precision
prediction error
Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed precision
training time prediction
distributed deep learning
precision-aware modeling
resource allocation
🔎 Similar Papers
No similar papers found.