🤖 AI Summary
Addressing two open challenges—low sample efficiency in strategic exploration and lack of theoretical guarantees for offline data utilization—in actor-critic algorithms with function approximation, this paper proposes the first actor-critic framework integrating optimism, off-policy optimal Q-estimation, and sparse policy reset. It achieves the optimal trajectory complexity $O(1/varepsilon^2)$ under active exploration, with a sample complexity bound of $O(dH^5 log|A| / varepsilon^2 + dH^4 log|F| / varepsilon^2)$ and $sqrt{T}$ regret. Moreover, initializing the critic with offline data eliminates the need for optimism: convergence is guaranteed when $N_{ ext{off}} geq c cdot dH^4 / varepsilon^2$. Crucially, this work breaks the single-policy concentrability coefficient barrier, yielding the first theoretically grounded, non-optimistic, and computationally efficient algorithm for hybrid RL. The analysis is rigorously formalized via Bellman Eluder dimension.
📝 Abstract
Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $epsilon$-optimal policy with a sample complexity of $O(1/epsilon^2)$ trajectories with general function approximation when strategic exploration is necessary. We address this open problem by introducing a novel actor-critic algorithm that attains a sample-complexity of $O(dH^5 log|mathcal{A}|/epsilon^2 + d H^4 log|mathcal{F}|/ epsilon^2)$ trajectories, and accompanying $sqrt{T}$ regret when the Bellman eluder dimension $d$ does not increase with $T$ at more than a $log T$ rate. Here, $mathcal{F}$ is the critic function class, $mathcal{A}$ is the action space, and $H$ is the horizon in the finite horizon MDP setting. Our algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets. We extend this to the setting of Hybrid RL, showing that initializing the critic with offline data yields sample efficiency gains compared to purely offline or online RL. Further, utilizing access to offline data, we provide a extit{non-optimistic} provably efficient actor-critic algorithm that only additionally requires $N_{ ext{off}} geq c_{ ext{off}}^*dH^4/epsilon^2$ in exchange for omitting optimism, where $c_{ ext{off}}^*$ is the single-policy concentrability coefficient and $N_{ ext{off}}$ is the number of offline samples. This addresses another open problem in the literature. We further provide numerical experiments to support our theoretical findings.