๐ค AI Summary
This paper studies reinforcement learning for infinite-horizon average-reward linear Markov decision processes (MDPs), addressing challenging settings where the Bellman operator is non-contractive and the system may lack ergodicity. To circumvent strong assumptions such as ergodicity while ensuring computational efficiency, we propose a novel algorithmic framework combining discounted approximation MDPs with optimistic value iteration. Specifically, we construct a discounted surrogate MDP to enable efficient optimization and introduce a value-function span clipping mechanism to explicitly control effective horizon dependence. Our method achieves, for the first time, an $widetilde{O}(sqrt{T})$ regret bound for average-reward linear MDPs in polynomial timeโbreaking prior reliance on stringent dynamical assumptions (e.g., mixing or ergodicity). This establishes a critical balance between theoretical rigor and algorithmic practicality.
๐ Abstract
We study the infinite-horizon average-reward reinforcement learning with linear MDPs. Previous approaches either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity, for achieving a regret bound of $widetilde{O}(sqrt{T})$. In this paper, we propose an algorithm that achieves the regret bound of $widetilde{O}(sqrt{T})$ and is computationally efficient in the sense that the time complexity is polynomial in problem parameters. Our algorithm runs an optimistic value iteration on a discounted-reward MDP that approximates the average-reward setting. With an appropriately tuned discounting factor $gamma$, the algorithm attains the desired $widetilde{O}(sqrt{T})$ regret. The challenge in our approximation approach is to get a regret bound with a sharp dependency on the effective horizon $1 / (1 - gamma)$. We address this challenge by clipping the value function obtained at each value iteration step to limit the span of the value function.