🤖 AI Summary
We address sequential decision-making problems—such as ride-hailing dispatch, wireless communication scheduling, and content recommendation—where both resource availability and rewards are unknown a priori and probing incurs high cost. To this end, we propose the Probe-Enhanced User-Centric Selection (PUCS) framework, enabling a two-phase “probe-then-assign” policy. First, we formulate PUCS as a unified optimization model and design an offline greedy probing algorithm achieving a constant approximation ratio ζ = (e−1)/(2e−1). Second, we develop the online learning algorithm OLPA, attaining a tight regret bound of O(√T + ln²T), with theoretical proof that its logarithmic factor is optimal. Leveraging combinatorial stochastic bandit learning, adaptive information probing strategies, and rigorous probabilistic analysis, OLPA significantly outperforms state-of-the-art baselines on real-world datasets, empirically validating both the efficacy of proactive probing and the user-centric assignment paradigm.
📝 Abstract
We formalize sequential decision-making with information acquisition as the probing-augmented user-centric selection (PUCS) framework, where a learner first probes a subset of arms to obtain side information on resources and rewards, and then assigns $K$ plays to $M$ arms. PUCS covers applications such as ridesharing, wireless scheduling, and content recommendation, in which both resources and payoffs are initially unknown and probing is costly. For the offline setting with known distributions, we present a greedy probing algorithm with a constant-factor approximation guarantee $ζ= (e-1)/(2e-1)$. For the online setting with unknown distributions, we introduce OLPA, a stochastic combinatorial bandit algorithm that achieves a regret bound $mathcal{O}(sqrt{T} + ln^{2} T)$. We also prove a lower bound $Ω(sqrt{T})$, showing that the upper bound is tight up to logarithmic factors. Experiments on real-world data demonstrate the effectiveness of our solutions.