Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

262K/year

🤖 AI Summary

This work addresses the Good Policy Identification (GPI) problem in reinforcement learning with bandit feedback, which aims to determine—under a fixed confidence level—whether there exists a policy whose expected return meets or exceeds a given threshold, while minimizing sampling complexity. The authors propose BEE-GPI, the first algorithm tailored for GPI, built upon a pure exploration framework that integrates confidence bounds with an adaptive sampling strategy to efficiently distinguish between positive and negative instances. Theoretical analysis shows that BEE-GPI achieves a sampling complexity upper bound of $O(H^2 / (V^* - \mu_0)^2)$, independent of the size of the state-action space, and establishes that the $1/(V^* - \mu_0)^2$ dependence is information-theoretically necessary, rendering the algorithm nearly optimal. Empirical results further confirm its practical efficiency.

📝 Abstract

Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enough'' policy suffices, we study an alternate objective of Good Policy Identification (GPI). For a given reward threshold $μ_0$, GPI only requires identifying a policy with expected reward in an episode at least $μ_0$ if such a policy exists (positive instance), or declaring None if no such policy exists (negative instance). We formalize GPI under the fixed-confidence setting. We require the output to be correct with probability $\geq 1-δ$, and seek to minimize the expected sample complexity, which is the expected number of episodes explored for the output. We propose a novel algorithm BEE-GPI, and derive theoretically-grounded upper bounds on its sample complexity for positive and negative instances. Notably, for positive instances, the coefficient of $\log 1/δ$ in our upper bound is $O(H^2/(V^* - μ_0)^2)$, where $H$ is the episode length and $V^*$ is the optimal expected reward in an episode. The coefficient does not depend on the action and state space sizes otherwise, in sharp contrast to the sample complexity in BPI. We further establish lower bound results to show the near-optimality of BEE-GPI and the necessity of the $1/(V^* -μ)^2$ term. Numerical experiments further validate the efficiency of our approach.

Problem

Research questions and friction points this paper is trying to address.

Good Policy Identification

Pure Exploration

Reinforcement Learning

Bandit Feedback

Sample Complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Good Policy Identification

Pure Exploration

Sample Complexity