Learning to Allocate Resources with Censored Feedback

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge of online resource allocation under truncated feedback, where rewards are observable only under specific conditions, thereby hindering the learning of optimal policies. The authors study how to dynamically allocate a budget across multiple options when feedback is censored, jointly estimating unknown parameters while balancing exploration and exploitation to maximize cumulative reward. They establish the first information-theoretic lower bound of Ω(T^{1/3}) on the regret for this problem. Two optimistic upper confidence bound (UCB)-based algorithms are proposed: RA-UCB, which achieves an Õ(√T) regret bound under known budgets (and O(poly-log T) under stronger assumptions), and MG-UCB, which maintains logarithmic performance even when the budget is unknown. Theoretical guarantees are complemented by empirical validation on real-world data.

Technology Category

Application Category

📝 Abstract

We study the online resource allocation problem in which at each round, a budget $B$ must be allocated across $K$ arms under censored feedback. An arm yields a reward if and only if two conditions are satisfied: (i) the arm is activated according to an arm-specific Bernoulli random variable with unknown parameter, and (ii) the allocated budget exceeds a random threshold drawn from a parametric distribution with unknown parameter. Over $T$ rounds, the learner must jointly estimate the unknown parameters and allocate the budget so as to maximize cumulative reward facing the exploration--exploitation trade-off. We prove an information-theoretic regret lower bound $\Omega(T^{1/3})$, demonstrating the intrinsic difficulty of the problem. We then propose RA-UCB, an optimistic algorithm that leverages non-trivial parameter estimation and confidence bounds. When the budget $B$ is known at the beginning of each round, RA-UCB achieves a regret of order $\widetilde{\mathcal{O}}(\sqrt{T})$, and even $\mathcal{O}(\mathrm{poly}\text{-}\log T)$ under stronger assumptions. As for unknown, round dependent budget, we introduce MG-UCB, which allows within-round switching and infinitesimal allocations, and matches the regret guarantees of RA-UCB. We then validate our theoretical results through experiments on real-world datasets.

Problem

Research questions and friction points this paper is trying to address.

online resource allocation

censored feedback

exploration-exploitation trade-off

budget allocation

multi-armed bandits

Innovation

Methods, ideas, or system contributions that make the work stand out.

censored feedback

online resource allocation

regret lower bound