Actor-Critics Can Achieve Optimal Sample Efficiency

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing two open challenges—low sample efficiency in strategic exploration and lack of theoretical guarantees for offline data utilization—in actor-critic algorithms with function approximation, this paper proposes the first actor-critic framework integrating optimism, off-policy optimal Q-estimation, and sparse policy reset. It achieves the optimal trajectory complexity $O(1/varepsilon^2)$ under active exploration, with a sample complexity bound of $O(dH^5 log|A| / varepsilon^2 + dH^4 log|F| / varepsilon^2)$ and $sqrt{T}$ regret. Moreover, initializing the critic with offline data eliminates the need for optimism: convergence is guaranteed when $N_{ ext{off}} geq c cdot dH^4 / varepsilon^2$. Crucially, this work breaks the single-policy concentrability coefficient barrier, yielding the first theoretically grounded, non-optimistic, and computationally efficient algorithm for hybrid RL. The analysis is rigorously formalized via Bellman Eluder dimension.

Technology Category

Application Category

📝 Abstract
Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $epsilon$-optimal policy with a sample complexity of $O(1/epsilon^2)$ trajectories with general function approximation when strategic exploration is necessary. We address this open problem by introducing a novel actor-critic algorithm that attains a sample-complexity of $O(dH^5 log|mathcal{A}|/epsilon^2 + d H^4 log|mathcal{F}|/ epsilon^2)$ trajectories, and accompanying $sqrt{T}$ regret when the Bellman eluder dimension $d$ does not increase with $T$ at more than a $log T$ rate. Here, $mathcal{F}$ is the critic function class, $mathcal{A}$ is the action space, and $H$ is the horizon in the finite horizon MDP setting. Our algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets. We extend this to the setting of Hybrid RL, showing that initializing the critic with offline data yields sample efficiency gains compared to purely offline or online RL. Further, utilizing access to offline data, we provide a extit{non-optimistic} provably efficient actor-critic algorithm that only additionally requires $N_{ ext{off}} geq c_{ ext{off}}^*dH^4/epsilon^2$ in exchange for omitting optimism, where $c_{ ext{off}}^*$ is the single-policy concentrability coefficient and $N_{ ext{off}}$ is the number of offline samples. This addresses another open problem in the literature. We further provide numerical experiments to support our theoretical findings.
Problem

Research questions and friction points this paper is trying to address.

Achieving optimal sample efficiency in actor-critic RL algorithms
Addressing sample complexity with general function approximation
Integrating offline data for hybrid RL efficiency gains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel actor-critic algorithm with optimal sample complexity
Integrates optimism, off-policy critic estimation, rare-switching resets
Hybrid RL with offline data for efficiency gains
🔎 Similar Papers
No similar papers found.
K
Kevin Tan
Department of Statistics and Data Science, The Wharton School, University of Pennsylvania
W
Wei Fan
Department of Statistics and Data Science, The Wharton School, University of Pennsylvania
Yuting Wei
Yuting Wei
Statistics and Data Science at Wharton, University of Pennsylvania
High dimensional statisticsnonparametric statisticsreinforcement learningdiffusion models