Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the issue of exploration collapse in reinforcement learning, where concurrent positive sharpening and negative compression induce recursive space contraction (RSC), leading to irreversible decay in the probability of sampling effective actions. To mitigate this, the authors propose Anchored Policy Optimization (APO), which constructs a safety manifold based on the high-confidence support set of a reference model. By shifting the policy constraint from global distribution matching to support-set coverage, APO enables an elastic recovery mechanism that permits aggressive policy sharpening while preserving resilience against catastrophic failure. This approach effectively breaks the inherent trade-off between accuracy and diversity. Empirical results demonstrate that APO significantly improves Pass@1 accuracy on mathematical reasoning tasks and successfully recovers the Pass@K diversity lost by standard policy gradient methods.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.

Problem

Research questions and friction points this paper is trying to address.

Recursive Space Contraction

Exploration Collapse

Reinforcement Learning with Verifiable Rewards

Support Coverage

Policy Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchored Policy Optimization

Support Coverage

Recursive Space Contraction