🤖 AI Summary
This work addresses the challenge of coupling slow resource provisioning with fast scheduling decisions in network resource allocation, where the former is constrained by switching costs and the latter must satisfy dynamically evolving budget constraints. To this end, we propose the first bilevel online learning framework that integrates Online Convex Optimization (OCO) with Constrained Markov Decision Processes (CMDPs). The upper level performs budget allocation via OCO with switching costs, while the lower level executes state-dependent safe scheduling based on CMDPs. A novel dual feedback mechanism propagates sensitivity information of budget multipliers across layers to enforce cross-level constraint coupling. Additionally, we introduce a budget-adaptive safe exploration strategy to handle dynamic constraints. Theoretical analysis shows that the proposed method achieves near-optimal cumulative regret while satisfying cross-level constraints with high probability, offering dual guarantees on both performance and feasibility.
📝 Abstract
We study a bi-level online provisioning and scheduling problem motivated by network resource allocation, where provisioning decisions are made at a slow time scale while queue-/state-dependent scheduling is performed at a fast time scale. We model this two-time-scale interaction using an upper-level online convex optimization (OCO) problem and a lower-level constrained Markov decision process (CMDP). Existing OCO typically assumes stateless decisions and thus cannot capture MDP network dynamics such as queue evolution. Meanwhile, CMDP algorithms typically assume a fixed constraint threshold, whereas in provisioning-and-scheduling systems, the threshold varies with online budget decisions. To address these gaps, we study bi-level OCO-CMDP learning under switching costs (budget reprovisioning/system reconfiguration) and cross-level constraints that couple budgets to scheduling decisions. Our new algorithm solves this learning problem via several non-trivial developments, including a carefully designed dual feedback that returns the budget multiplier as sensitivity information for the upper-level update and a lower level that solves a budget-adaptive safe exploration problem via an extended occupancy-measure linear program. We establish near-optimal regret and high-probability satisfaction of the cross-level constraints.