🤖 AI Summary
This paper addresses regret minimization in average-reward reinforcement learning, proposing a unified optimistic Q-learning framework applicable to both average-reward and episodic settings. Methodologically, it introduces the span-contraction average Bellman operator $overline{L} v = frac{1}{H} sum_{h=1}^H L^h v$, coupled with span-norm convergence analysis, to bridge episodic and non-episodic regret analyses. Its key contribution is a significantly weaker structural assumption: existence of a recurrent state $s_0$ with expected first-hitting time at most $H$, relaxing conventional periodicity or mixing-time assumptions. Under this minimal assumption, the algorithm achieves a regret bound of $ ilde{O}(H^5 S sqrt{AT})$ over $T$ steps, where $S$ and $A$ denote the numbers of states and actions, respectively—improving upon existing model-free average-reward RL algorithms. The framework unifies analysis techniques across reward settings while attaining tighter, assumption-light regret guarantees.
📝 Abstract
We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the expected time to visit some frequent state $s_0$ is finite and upper bounded by $H$. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time {it for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of $ ilde{O}(H^5 Ssqrt{AT})$, where $S$ and $A$ are the numbers of states and actions, and $T$ is the horizon. A key technical novelty of our work is to introduce an $overline{L}$ operator defined as $overline{L} v = frac{1}{H} sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. We show that under the given assumption, the $overline{L}$ operator has a strict contraction (in span) even in the average reward setting. Our algorithm design then uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Therefore, we provide a unified view of regret minimization in episodic and non-episodic settings that may be of independent interest.