Optimistic Q-learning for average reward and episodic reinforcement learning

📅 2024-07-18
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses regret minimization in average-reward reinforcement learning, proposing a unified optimistic Q-learning framework applicable to both average-reward and episodic settings. Methodologically, it introduces the span-contraction average Bellman operator $overline{L} v = frac{1}{H} sum_{h=1}^H L^h v$, coupled with span-norm convergence analysis, to bridge episodic and non-episodic regret analyses. Its key contribution is a significantly weaker structural assumption: existence of a recurrent state $s_0$ with expected first-hitting time at most $H$, relaxing conventional periodicity or mixing-time assumptions. Under this minimal assumption, the algorithm achieves a regret bound of $ ilde{O}(H^5 S sqrt{AT})$ over $T$ steps, where $S$ and $A$ denote the numbers of states and actions, respectively—improving upon existing model-free average-reward RL algorithms. The framework unifies analysis techniques across reward settings while attaining tighter, assumption-light regret guarantees.

Technology Category

Application Category

📝 Abstract
We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the expected time to visit some frequent state $s_0$ is finite and upper bounded by $H$. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time {it for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of $ ilde{O}(H^5 Ssqrt{AT})$, where $S$ and $A$ are the numbers of states and actions, and $T$ is the horizon. A key technical novelty of our work is to introduce an $overline{L}$ operator defined as $overline{L} v = frac{1}{H} sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. We show that under the given assumption, the $overline{L}$ operator has a strict contraction (in span) even in the average reward setting. Our algorithm design then uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Therefore, we provide a unified view of regret minimization in episodic and non-episodic settings that may be of independent interest.
Problem

Research questions and friction points this paper is trying to address.

Minimize regret in average reward reinforcement learning
Generalize episodic setting with relaxed MDP assumptions
Introduce novel L operator for contraction analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic Q-learning for average reward
Novel $overline{L}$ operator for contraction
Unified episodic and non-episodic regret minimization