🤖 AI Summary
This work addresses the rigidity of existing on-demand and spot instance models in public clouds, which struggle to dynamically balance tenant objectives with operator constraints in heterogeneous, overcommitted clusters and lack flexible scheduling mechanisms that avoid exposing internal system states. The paper proposes the first cloud platform enabling continuous runtime resource renegotiation, leveraging real-time bidding to dynamically allocate resources and incorporating an incentive-aligned pricing mechanism that implicitly accounts for operational constraints—such as energy consumption and carbon emissions—without disclosing internal telemetry. Experimental results demonstrate that the approach reduces contention-induced performance degradation by 8–23% across diverse accelerator workloads and scales effectively to clusters comprising tens of thousands of nodes.
📝 Abstract
Public clouds increasingly expose heterogeneous hardware, but their allocation interface remains built around rigid on-demand and spot service classes. This makes it hard to satisfy time-varying tenant objectives and operator constraints in oversubscribed, heterogeneous clusters without exposing internal application or infrastructure state. We present LaissezCloud, a cloud resource management platform for continuous re-negotiation of running allocations. Unlike spot instances, which use launch-time bids and unilateral preemption, LaissezCloud keeps allocations continuously contestable during execution: tenants and operators update bids online, and a running tenant keeps a resource only as long as its bid exceeds competing demand. Pricing serves both as a narrow waist and as an incentive-alignment mechanism between mutually untrusted participants: tenants express utility through bids, while operators price in power, cooling, or carbon constraints without exposing internal telemetry. Across a diverse set of accelerator workloads, LaissezCloud reduces performance degradation under contention by 8-23% versus on-demand and spot baselines, and scales to clusters of at least 10,000 nodes.