🤖 AI Summary
This work addresses the challenge of reward-free exploration, aiming to efficiently learn near-optimal policies for any subsequent reward function—either fully arbitrary (reward-free) or drawn from a limited class (reward-agnostic)—without access to reward signals during exploration. To this end, the authors propose a novel algorithm that combines online learning with a carefully designed intrinsic reward mechanism to effectively collect data and accurately estimate the environment dynamics in the absence of external rewards. Upon reward revelation, the algorithm rapidly computes an ε-optimal policy. Theoretically, the work significantly relaxes the constraints on ε in the reward-agnostic setting and achieves minimax-optimal sample complexity. Moreover, it establishes the first matching upper and lower bounds for reward-free exploration, thereby closing a long-standing gap in the theoretical understanding of this setting.
📝 Abstract
We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.