Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This paper studies reinforcement learning and imitation learning in infinite-horizon discounted linear Markov decision processes (MDPs). To simultaneously address computational efficiency and theoretical guarantees, we propose the first algorithm that is both computationally efficient and achieves near-optimal regret. Our method integrates two optimistic exploration mechanisms—additive reward bonuses and an artificial absorbing state—within a regularized approximate dynamic programming framework employing linear function approximation, enabling robust learning against adversarial reward sequences. Theoretically, we establish a regret upper bound of $ ilde{mathcal{O}}(sqrt{d^3 (1-gamma)^{-7/2} T})$, which significantly improves upon prior results. Moreover, in imitation learning, our approach attains state-of-the-art performance. Overall, this work introduces a novel paradigm for balancing exploration and exploitation and achieving robust policy learning in the linear MDP setting.

Technology Category

Application Category

📝 Abstract

We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving near-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $ ilde{mathcal{O}} (sqrt{d^3 (1 - gamma)^{- 7 / 2} T})$, where $T$ is the total number of sample transitions, $gamma in (0,1)$ is the discount factor, and $d$ is the feature dimensionality. The results continue to hold against adversarial reward sequences, enabling application of our method to the problem of imitation learning in linear MDPs, where we achieve state-of-the-art results.

Problem

Research questions and friction points this paper is trying to address.

Efficient algorithm for infinite-horizon MDPs

Combines optimistic exploration techniques

Achieves near-optimal regret guarantees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic exploration bonuses

Artificial absorbing state transitions

Regularized dynamic-programming scheme

🔎 Similar Papers

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs