π€ AI Summary
This paper studies the bandits with knapsacks (BwK) problem under non-stationary contexts: resource initializations are heterogeneous across rounds, context distributions evolve over time, yet all rounds share a common latent conversion model. Motivated by dynamic pricing and repeated auctions, we propose the first BwK learning framework that incorporates unlabeled feature dataβmarking the first approach to efficiently handle infinite-state-space reinforcement learning within contextual BwK. Methodologically, we integrate upper confidence bound (UCB) principles with conversion model estimation, leveraging a confidence-interval oracle with $o(T)$ regret guarantee to jointly adapt to non-stationarity and budget constraints. Theoretically, our algorithm achieves a sublinear $O(sqrt{T})$ regret bound, substantially improving upon existing baselines. This work establishes a novel paradigm for sequential decision-making under resource constraints and dynamically evolving contextual information.
π Abstract
We study an online setting, where a decision maker (DM) interacts with contextual bandit-with-knapsack (BwK) instances in repeated episodes. These episodes start with different resource amounts, and the contexts' probability distributions are non-stationary in an episode. All episodes share the same latent conversion model, which governs the random outcome contingent upon a request's context and an allocation decision. Our model captures applications such as dynamic pricing on perishable resources with episodic replenishment, and first price auctions in repeated episodes with different starting budgets. We design an online algorithm that achieves a regret sub-linear in $T$, the number of episodes, assuming access to a emph{confidence bound oracle} that achieves an $o(T)$-regret. Such an oracle is readily available from existing contextual bandit literature. We overcome the technical challenge with arbitrarily many possible contexts, which leads to a reinforcement learning problem with an unbounded state space. Our framework provides improved regret bounds in certain settings when the DM is provided with unlabeled feature data, which is novel to the contextual BwK literature.