π€ AI Summary
In reinforcement learning, value functions must be accurate under the state visitation distribution of the target policy, which is typically unknown and difficult to sample. This work proposes Approximate Next-Policy Sampling (ANPS), a method that adjusts the training data distribution to approximate the target policyβs distribution, thereby replacing conventional conservative update mechanisms and enabling, for the first time, safe policy updates with large step sizes. Building on ANPS, the authors develop the Stable Value Approximation Policy Iteration (SV-API) framework, which alternates between a fixed target policy and a dynamic behavior policy, and integrate it into PPO to yield SV-PPO. Experiments demonstrate that SV-PPO maintains convergence even with substantially larger update steps and achieves performance on par with or superior to existing baselines across Atari and continuous control benchmarks.
π Abstract
We revisit a classic "chicken-and-egg" problem in reinforcement learning: to safely improve a policy, the value function must be accurate on the state-visitation distribution of the updated policy. That distribution over states is unknown and cannot be sampled for the purposes of training the value function. Conservative updates solve this problem, but at the cost of shrinking the policy update. This paper explores an alternative solution, Approximate Next Policy Sampling (ANPS), which addresses the problem by modifying the training distribution rather than constraining the policy update. ANPS is satisfied if the distribution of the training data approximates that of the next policy. To demonstrate the feasibility and efficacy of ANPS, we introduce Stable Value Approximate Policy Iteration (SV-API). SV-API modifies the standard approximate policy iteration loop to hold the target policy fixed while an iteratively updated behavioral policy gathers relevant experience. It only commits to a new policy once a convergence criterion has been met. If certain stability criteria are met, the update is guaranteed to be safe; otherwise, it remains no less safe than standard approximate policy iteration. Applying SV-API to PPO yields Stable Value PPO (SV-PPO), which matches or improves performance on high-dimensional discrete (Atari) and continuous control benchmarks while executing substantially larger target policy updates. These results demonstrate the viability of ANPS as a new solution to this classic challenge in RL.