Episodic Contextual Bandits with Knapsacks under Conversion Models

πŸ“… 2025-07-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper studies the bandits with knapsacks (BwK) problem under non-stationary contexts: resource initializations are heterogeneous across rounds, context distributions evolve over time, yet all rounds share a common latent conversion model. Motivated by dynamic pricing and repeated auctions, we propose the first BwK learning framework that incorporates unlabeled feature dataβ€”marking the first approach to efficiently handle infinite-state-space reinforcement learning within contextual BwK. Methodologically, we integrate upper confidence bound (UCB) principles with conversion model estimation, leveraging a confidence-interval oracle with $o(T)$ regret guarantee to jointly adapt to non-stationarity and budget constraints. Theoretically, our algorithm achieves a sublinear $O(sqrt{T})$ regret bound, substantially improving upon existing baselines. This work establishes a novel paradigm for sequential decision-making under resource constraints and dynamically evolving contextual information.

Technology Category

Application Category

πŸ“ Abstract
We study an online setting, where a decision maker (DM) interacts with contextual bandit-with-knapsack (BwK) instances in repeated episodes. These episodes start with different resource amounts, and the contexts' probability distributions are non-stationary in an episode. All episodes share the same latent conversion model, which governs the random outcome contingent upon a request's context and an allocation decision. Our model captures applications such as dynamic pricing on perishable resources with episodic replenishment, and first price auctions in repeated episodes with different starting budgets. We design an online algorithm that achieves a regret sub-linear in $T$, the number of episodes, assuming access to a emph{confidence bound oracle} that achieves an $o(T)$-regret. Such an oracle is readily available from existing contextual bandit literature. We overcome the technical challenge with arbitrarily many possible contexts, which leads to a reinforcement learning problem with an unbounded state space. Our framework provides improved regret bounds in certain settings when the DM is provided with unlabeled feature data, which is novel to the contextual BwK literature.
Problem

Research questions and friction points this paper is trying to address.

Online decision-making with varying resource amounts and non-stationary contexts
Dynamic pricing and auctions under episodic replenishment and budgets
Regret minimization in contextual bandits with knapsacks and conversion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Episodic contextual bandits with knapsacks
Sub-linear regret via confidence bound oracle
Handles unbounded state space in RL
πŸ”Ž Similar Papers
2024-07-24arXiv.orgCitations: 4
2024-02-27IEEE Transactions on Information TheoryCitations: 1
Wang Chi Cheung
Wang Chi Cheung
Department of Industrial Systems Engineering and Management, National University of Singapore
Operations ResearchMachine Learning
Z
Zitian Li
Industrial Systems Engineering & Management, National University of Singapore, Engineering Drive 2 Block E1A 06-25 Singapore 117576