Spot-and-Scoot: Peeking Into Spot Instance Availability

📅 2026-04-08

📈 Citations: 1

✨ Influential: 0

career value

166K/year

🤖 AI Summary

High observation costs hinder effective monitoring of the dynamic availability of cloud spot instances. This work proposes Ding-Dong Ditch, a novel method that leverages the mechanism of immediately canceling spot instance requests upon acceptance to obtain binary availability signals at near-zero runtime cost, while estimating available capacity through concurrent requests. It is the first approach to actively probe availability using early-stage signals from cloud platform scheduling lifecycles, revealing that interruptions of the same instance type are highly synchronized within three minutes. Experiments across 68 instance types and 15 regions on AWS and Azure demonstrate that the method achieves an F1-macro score of 0.90 for current availability modeling, maintaining 0.85 even for 60-minute-ahead predictions. TPC-DS workload simulations further confirm its effectiveness in significantly reducing computational loss.

📝 Abstract

Spot instances offer significant cost savings of up to 90% over on-demand prices, making them an attractive resource for large-scale computing workloads. However, understanding their availability dynamics is essential for building systems that tolerate interruptions, and observing this availability directly requires keeping instances running, which incurs costs that scale with the number of monitored instance types and their per-instance price. We propose Spot-and-Scoot (SnS), a cost-efficient method that collects spot instance availability signals by leveraging the cloud provider's provisioning lifecycle. Since the outcome of a spot request is determined before the instance enters the running state, SnS submits requests and cancels them upon provisioning acceptance, collecting binary availability signals at near-zero instance cost. Submitting multiple concurrent requests per measurement point further yields a quantitative estimate of available capacity. We validate SnS through simultaneous collection of probing signals and actual running instance traces across 68 instance types and 15 regions on both AWS and Azure, totaling 336,033 spot requests. Analysis of 2,635 real-world interruption events reveals that co-interruptions within the same instance type and availability zone occur within three minutes in over 92% of cases, motivating a binary availability formulation. Based on this formulation, we derive three complementary features from SnS signals and demonstrate that their combination achieves an F1-macro score of up to 0.90 for current availability modeling and maintains 0.85 at a 60-minute prediction horizon. A trace-driven simulation using TPC-DS workloads further demonstrates the potential of SnS-based prediction to reduce lost computation compared to an unguided baseline.

Problem

Research questions and friction points this paper is trying to address.

spot instances

availability monitoring

cost-efficient probing

interruption tolerance

cloud computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

spot instances

availability probing

cost-efficient monitoring