🤖 AI Summary
Deep reinforcement learning (DRL) suffers from low sample efficiency, training instability, and high sensitivity to hyperparameters. To address these challenges, this paper proposes Reliable Policy Iteration (RPI), a novel policy optimization framework. RPI restores the monotonic convergence guarantee of value estimation—previously established only in tabular policy iteration—to the neural network function approximation setting. Crucially, it achieves this without target networks or strong regularization constraints; instead, it introduces a robust value update mechanism that ensures monotonic performance improvement during policy evaluation. Empirically, RPI significantly outperforms DQN, TD3, and PPO on CartPole and inverted pendulum benchmarks: it converges earlier to near-optimal performance, maintains stable training throughout, and exhibits strong robustness to perturbations in network architecture and environment dynamics. These results effectively mitigate common DRL issues—including training oscillation, sample inefficiency, and tedious hyperparameter tuning.
📝 Abstract
In a recent work, we proposed Reliable Policy Iteration (RPI), that restores policy iteration's monotonicity-of-value-estimates property to the function approximation setting. Here, we assess the robustness of RPI's empirical performance on two classical control tasks -- CartPole and Inverted Pendulum -- under changes to neural network and environmental parameters. Relative to DQN, Double DQN, DDPG, TD3, and PPO, RPI reaches near-optimal performance early and sustains this policy as training proceeds. Because deep RL methods are often hampered by sample inefficiency, training instability, and hyperparameter sensitivity, our results highlight RPI's promise as a more reliable alternative.