Think Outside the Policy: In-Context Steered Policy Optimization

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing verifiable-reward RL methods (e.g., GRPO) suffer from limited exploration and low trajectory diversity due to their reliance on on-policy sampling; while incorporating strong expert models can alleviate this, it incurs prohibitive computational cost and requires scarce, high-quality expert demonstrations. To address these limitations, we propose In-Context Policy Optimization (ICPO), a novel framework that leverages the in-context learning capability of large reasoning models (LRMs). ICPO integrates hybrid policy initialization, implicit expert guidance, expert-region rejection sampling, and annealed expert reward shaping—enabling substantial policy coverage expansion and training stabilization without requiring strong expert trajectories. Crucially, ICPO supports off-policy optimization, enhancing both sample efficiency and robustness. Empirical evaluation on mathematical reasoning benchmarks demonstrates significant improvements in performance and training stability, validating ICPO’s scalability and effectiveness for LRM-based RL.

Technology Category

Application Category

📝 Abstract
Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts where confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advaned models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates Expert Region Reject Sampling to filter unreliable off-policy trajectories and Annealed Expert-Bonus Reward Shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs.
Problem

Research questions and friction points this paper is trying to address.

Expanding limited policy exploration in reinforcement learning
Reducing reliance on inaccessible expert models for guidance
Enhancing training stability and trajectory diversity in RLVR
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context learning provides expert guidance from datasets
Mixed-policy optimization expands exploration beyond current policy
Reject sampling and reward shaping stabilize training process
🔎 Similar Papers
No similar papers found.
H
Hsiu-Yuan Huang
National Key Laboratory for Multimedia Information Processing, Peking University
C
Chenming Tang
National Key Laboratory for Multimedia Information Processing, Peking University
Weijie Liu
Weijie Liu
Nankai University
System SecurityVirtualizationBinary AnalysisImage Fusion
S
Saiyong Yang
LLM Department, Tencent
Yunfang Wu
Yunfang Wu
Peking University
NLP