🤖 AI Summary
Policy gradient (PG) methods suffer from low sample efficiency and slow convergence in continuous control due to their reliance on newly sampled on-policy trajectories. This work provides the first theoretical proof that reusing historical off-policy trajectories achieves the optimal convergence rate of Õ(ε⁻¹). To this end, we propose a power-mean-corrected multiple importance weighting estimator and design the Randomized Policy Gradient (RPG) algorithm, which integrates off-policy learning with stochastic gradient analysis. Theoretically, RPG attains the state-of-the-art sample complexity for PG methods. Empirically, it significantly outperforms existing advanced PG algorithms across benchmark continuous-control tasks. Our core contributions are: (i) establishing the first rigorous convergence guarantee for trajectory reuse in PG; and (ii) introducing a bias-correction mechanism that simultaneously achieves theoretical optimality and practical effectiveness.
📝 Abstract
Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. These methods learn the parameters of parametric policies via stochastic gradient ascent, typically using on-policy trajectory data to estimate the policy gradient. However, such reliance on fresh data makes them sample-inefficient. Indeed, vanilla PG methods require $O(epsilon^{-2})$ trajectories to reach an $epsilon$-approximate stationary point. A common strategy to improve efficiency is to reuse off-policy information from past iterations, such as previous gradients or trajectories. While gradient reuse has received substantial theoretical attention, leading to improved rates of $O(epsilon^{-3/2})$, the reuse of past trajectories remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that extensive reuse of past off-policy trajectories can significantly accelerate convergence in PG methods. We introduce a power mean correction to the multiple importance weighting estimator and propose RPG (Retrospective Policy Gradient), a PG algorithm that combines old and new trajectories for policy updates. Through a novel analysis, we show that, under established assumptions, RPG achieves a sample complexity of $widetilde{O}(epsilon^{-1})$, the best known rate in the literature. We further validate empirically our approach against PG methods with state-of-the-art rates.