🤖 AI Summary
To address the challenge of cost-efficient and reliable AI model serving using spot instances across regions and clouds, this paper proposes SkyServe—a system integrating the SpotHedge fault-tolerant scheduling strategy. SpotHedge achieves high reliability and large-scale adoption of spot instances in production AI services through three key mechanisms: (1) cross-failure-domain instance dispersion, (2) proactive over-provisioning, and (3) dynamic failback to on-demand instances. The system unifies spot lifecycle-aware scheduling, SLA-driven elastic scaling, distributed multi-cloud orchestration, and low-overhead workload hot migration. Experimental evaluation demonstrates that, compared to an all-on-demand baseline, SkyServe reduces average deployment cost by 43%, while improving median (P50), 90th-percentile (P90), and 99th-percentile (P99) latency by 2.3×, 2.1×, and 2.1×, respectively. Crucially, it attains production-grade resource availability—establishing the first practical framework for dependable, large-scale AI serving on spot infrastructure.
📝 Abstract
Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot instances have long been offered with a large discount, spot preemptions have discouraged users from using them to host model replicas when serving AI models. To address this, we propose a simple yet efficient policy, SpotHedge, that leverages spot replicas across different failure domains (e.g., regions and clouds) to ensure availability, lower costs, and high service quality. SpotHedge intelligently spreads spot replicas across different regions and clouds to improve availability and reduce correlated preemptions, overprovisions cheap spot replicas than required as a safeguard against possible preemptions, and dynamically falls back to on-demand replicas when spot replicas become unavailable. We built SkyServe, a system leveraging SpotHedge to efficiently serve AI models over a mixture of spot and on-demand replicas across regions and clouds. We compared SkyServe with both research and production systems on real AI workloads: SkyServe reduces cost by 43% on average while achieving high resource availability compared to using on-demand replicas. Additionally, SkyServe improves P50, P90, and P99 latency by 2.3$ imes$, 2.1$ imes$, 2.1$ imes$ on average compared to other research and production systems.