Reliable Policy Iteration: Performance Robustness Across Architecture and Environment Perturbations

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep reinforcement learning (DRL) suffers from low sample efficiency, training instability, and high sensitivity to hyperparameters. To address these challenges, this paper proposes Reliable Policy Iteration (RPI), a novel policy optimization framework. RPI restores the monotonic convergence guarantee of value estimation—previously established only in tabular policy iteration—to the neural network function approximation setting. Crucially, it achieves this without target networks or strong regularization constraints; instead, it introduces a robust value update mechanism that ensures monotonic performance improvement during policy evaluation. Empirically, RPI significantly outperforms DQN, TD3, and PPO on CartPole and inverted pendulum benchmarks: it converges earlier to near-optimal performance, maintains stable training throughout, and exhibits strong robustness to perturbations in network architecture and environment dynamics. These results effectively mitigate common DRL issues—including training oscillation, sample inefficiency, and tedious hyperparameter tuning.

Technology Category

Application Category

📝 Abstract
In a recent work, we proposed Reliable Policy Iteration (RPI), that restores policy iteration's monotonicity-of-value-estimates property to the function approximation setting. Here, we assess the robustness of RPI's empirical performance on two classical control tasks -- CartPole and Inverted Pendulum -- under changes to neural network and environmental parameters. Relative to DQN, Double DQN, DDPG, TD3, and PPO, RPI reaches near-optimal performance early and sustains this policy as training proceeds. Because deep RL methods are often hampered by sample inefficiency, training instability, and hyperparameter sensitivity, our results highlight RPI's promise as a more reliable alternative.
Problem

Research questions and friction points this paper is trying to address.

Restores monotonicity in policy iteration with function approximation
Assesses robustness to neural network and environmental parameter changes
Addresses sample inefficiency, instability, and hyperparameter sensitivity in deep RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Restores monotonicity in policy iteration with function approximation
Achieves near-optimal performance early in training
Demonstrates robustness to network and environment changes