Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses contextual bandits in episodic reinforcement learning where the action set varies dynamically with context at each round, aiming to minimize cumulative regret against the per-round optimal policy. The paper extends the MVP algorithm to this contextual action-set setting, establishing—for the first time—near-optimal minimax regret bounds of Õ(√(SAH³K log L)) under adversarial action contexts and Õ(√(SAH³K)) under stochastic contexts, along with a sample complexity of Õ(SAH³/ε²) for the latter. Furthermore, the authors develop an adaptive analysis leveraging suboptimality gaps, yielding refined regret bounds that significantly improve upon the minimax rate when gaps are large.
📝 Abstract
We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^K [V^{*,M^k} - V^{\pi^k,M^k}]$, where $M^k$ represents the action context in the $k$-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $\widetilde{O}(\sqrt{SAH^3K\log L})$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $\widetilde{O}(\sqrt{SAH^3K})$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $\widetilde{O}(SAH^3/\epsilon^2)$ for a fixed context distribution. In addition, we derive a gap-dependent regret bound of \[ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{\Delta_{\min}^{p}} + pK\Delta_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), \] where $\Delta_{\min}^{p}$ is the global $p$-trimmed positive-gap floor over suboptimal $(h,s,a)$ triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.
Problem

Research questions and friction points this paper is trying to address.

contextual action-set
reinforcement learning
regret bounds
episodic RL
admissible action sets
Innovation

Methods, ideas, or system contributions that make the work stand out.

contextual action-set reinforcement learning
minimax regret bound
gap-dependent regret
MVP algorithm
sample complexity
🔎 Similar Papers
2024-02-27IEEE Transactions on Information TheoryCitations: 1