Deadline-Aware Online Scheduling for LLM Fine-Tuning with Spot Market Predictions

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address high GPU costs, volatile spot instance prices and availability, and stringent deadline constraints in large language model fine-tuning, this paper proposes a deadline-aware online scheduling framework. Methodologically, it first reveals short-term predictability of spot market prices and resource availability; designs a prediction-driven resource allocation algorithm based on commitment levels; incorporates a prediction-free fallback mechanism for robustness; and develops an adaptive policy selection module with an $O(sqrt{T})$ theoretical regret bound. The framework integrates integer programming modeling, error-sensitivity analysis, and heterogeneous instance co-scheduling to jointly optimize cost, timeliness, and reliability under dynamic market conditions. Experiments demonstrate up to 54.8% utility improvement over baseline methods and yield a tighter performance bound dependent on prediction error.

Technology Category

Application Category

📝 Abstract

As foundation models grow in size, fine-tuning them becomes increasingly expensive. While GPU spot instances offer a low-cost alternative to on-demand resources, their volatile prices and availability make deadline-aware scheduling particularly challenging. We tackle this difficulty by using a mix of spot and on-demand instances. Distinctively, we show the predictability of prices and availability in a spot instance market, the power of prediction in enabling cost-efficient scheduling and its sensitivity to estimation errors. An integer programming problem is formulated to capture the use of mixed instances under both the price and availability dynamics. We propose an online allocation algorithm with prediction based on the committed horizon control approach that leverages a emph{commitment level} to enforce the partial sequence of decisions. When this prediction becomes inaccurate, we further present a complementary online algorithm without predictions. An online policy selection algorithm is developed that learns the best policy from a pool constructed by varying the parameters of both algorithms. We prove that the prediction-based algorithm achieves tighter performance bounds as prediction error decreases, while the policy selection algorithm possesses a regret bound of $mathcal{O}(sqrt{T})$. Experimental results demonstrate that our online framework can adaptively select the best policy under varying spot market dynamics and prediction quality, consistently outperforming baselines and improving utility by up to 54.8%.

Problem

Research questions and friction points this paper is trying to address.

Schedule LLM fine-tuning jobs to meet deadlines using spot instances

Predict spot market prices and availability for cost-efficient scheduling

Develop online algorithms adapting to prediction accuracy and market dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses mixed spot and on-demand GPU instances

Predicts spot market prices and availability for scheduling

Adaptively selects best online policy via learning

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing