Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

๐Ÿ“… 2024-05-23
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF

career value

203K/year
๐Ÿค– AI Summary
This paper studies reinforcement learning for infinite-horizon average-reward linear Markov decision processes (MDPs), addressing challenging settings where the Bellman operator is non-contractive and the system may lack ergodicity. To circumvent strong assumptions such as ergodicity while ensuring computational efficiency, we propose a novel algorithmic framework combining discounted approximation MDPs with optimistic value iteration. Specifically, we construct a discounted surrogate MDP to enable efficient optimization and introduce a value-function span clipping mechanism to explicitly control effective horizon dependence. Our method achieves, for the first time, an $widetilde{O}(sqrt{T})$ regret bound for average-reward linear MDPs in polynomial timeโ€”breaking prior reliance on stringent dynamical assumptions (e.g., mixing or ergodicity). This establishes a critical balance between theoretical rigor and algorithmic practicality.

Technology Category

Application Category

๐Ÿ“ Abstract
We study the infinite-horizon average-reward reinforcement learning with linear MDPs. Previous approaches either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity, for achieving a regret bound of $widetilde{O}(sqrt{T})$. In this paper, we propose an algorithm that achieves the regret bound of $widetilde{O}(sqrt{T})$ and is computationally efficient in the sense that the time complexity is polynomial in problem parameters. Our algorithm runs an optimistic value iteration on a discounted-reward MDP that approximates the average-reward setting. With an appropriately tuned discounting factor $gamma$, the algorithm attains the desired $widetilde{O}(sqrt{T})$ regret. The challenge in our approximation approach is to get a regret bound with a sharp dependency on the effective horizon $1 / (1 - gamma)$. We address this challenge by clipping the value function obtained at each value iteration step to limit the span of the value function.
Problem

Research questions and friction points this paper is trying to address.

Infinite-horizon average-reward reinforcement learning in linear MDPs.
Challenges due to non-contraction Bellman operator in algorithm design.
Achieving polynomial computational complexity without strong dynamic assumptions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Approximates average-reward MDPs via discounted MDPs
Uses optimistic value iteration for nonstationary policy planning
Introduces value function clipping for sample efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.