Episodic Contextual Bandits with Knapsacks under Conversion Models

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This paper studies the bandits with knapsacks (BwK) problem under non-stationary contexts: resource initializations are heterogeneous across rounds, context distributions evolve over time, yet all rounds share a common latent conversion model. Motivated by dynamic pricing and repeated auctions, we propose the first BwK learning framework that incorporates unlabeled feature data—marking the first approach to efficiently handle infinite-state-space reinforcement learning within contextual BwK. Methodologically, we integrate upper confidence bound (UCB) principles with conversion model estimation, leveraging a confidence-interval oracle with $o(T)$ regret guarantee to jointly adapt to non-stationarity and budget constraints. Theoretically, our algorithm achieves a sublinear $O(sqrt{T})$ regret bound, substantially improving upon existing baselines. This work establishes a novel paradigm for sequential decision-making under resource constraints and dynamically evolving contextual information.

Technology Category

Application Category

📝 Abstract

We study an online setting, where a decision maker (DM) interacts with contextual bandit-with-knapsack (BwK) instances in repeated episodes. These episodes start with different resource amounts, and the contexts' probability distributions are non-stationary in an episode. All episodes share the same latent conversion model, which governs the random outcome contingent upon a request's context and an allocation decision. Our model captures applications such as dynamic pricing on perishable resources with episodic replenishment, and first price auctions in repeated episodes with different starting budgets. We design an online algorithm that achieves a regret sub-linear in $T$, the number of episodes, assuming access to a emph{confidence bound oracle} that achieves an $o(T)$-regret. Such an oracle is readily available from existing contextual bandit literature. We overcome the technical challenge with arbitrarily many possible contexts, which leads to a reinforcement learning problem with an unbounded state space. Our framework provides improved regret bounds in certain settings when the DM is provided with unlabeled feature data, which is novel to the contextual BwK literature.

Problem

Research questions and friction points this paper is trying to address.

Online decision-making with varying resource amounts and non-stationary contexts

Dynamic pricing and auctions under episodic replenishment and budgets

Regret minimization in contextual bandits with knapsacks and conversion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Episodic contextual bandits with knapsacks

Sub-linear regret via confidence bound oracle

Handles unbounded state space in RL

🔎 Similar Papers

Towards Domain Adaptive Neural Contextual Bandits