Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses a key limitation of existing sample-efficient reinforcement learning algorithms under function approximation: their reliance on computationally infeasible planning or optimization oracles. The authors propose an optimistic Actor-Critic algorithm that requires only a policy evaluation oracle, establishing for the first time that policy evaluation alone suffices as the minimal computational primitive for achieving sample efficiency. Building on this insight, they develop a computationally tractable framework based on supervised learning surrogates, eliminating the need for expensive optimization subroutines. The method achieves improved theoretical sample complexity guarantees in low-rank and approximately low-rank Markov decision processes, and demonstrates strong empirical performance in standard Gym environments.

📝 Abstract

Reinforcement learning (RL) is a fundamental framework for sequential decision-making, in which an agent learns an optimal policy through interactions with an unknown environment. In settings with function approximation, many existing RL algorithms achieve favorable sample complexity, but often rely on computationally intractable oracles. In this paper, we use supervised learning as a computational proxy to establish a clear hierarchy of commonly adopted RL oracles under low-rank Markov Decision Processes (MDPs). This hierarchy shows that policy evaluation is the most computationally efficient oracle, provided that supervised learning can be efficiently solved. Motivated by this observation, we propose a novel optimistic actor-critic algorithm that relies solely on the policy evaluation oracle. We prove that our algorithm outperforms the existing sample complexity guarantees for low-rank MDPs while avoiding computationally expensive planning or optimization oracles commonly assumed in prior works. We further extend our theoretical results to approximately low-rank MDPs and demonstrate that this setting captures a broad class of real-world environments. Finally, we validate our theoretical results with experiments on several standard Gym environments.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Computational Efficiency

Actor-Critic

Low-Rank MDPs

Oracle Complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank MDPs

actor-critic

policy evaluation