SimKO: Simple Pass@K Policy Optimization

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning for verifiable reasoning (RLVR) methods suffer from insufficient exploration: while pass@1 improves, pass@K (K>1) degrades, primarily due to verification rewards causing excessive token-level concentration on the top-1 candidate. Method: We propose SimKO, a reinforcement learning optimization framework for multi-path reasoning. SimKO identifies high-confidence error dominance via token-level probability analysis and introduces an asymmetric gradient update strategy—selectively boosting probabilities of top-K correct continuations at high-entropy tokens while suppressing the over-dominance of top-1 incorrect responses. It requires no architectural modifications or reward function redesign, relying solely on verifiable reward signals and probabilistic distribution control. Contribution/Results: SimKO significantly improves pass@K (K=5,10,20) across multiple mathematical and logical reasoning benchmarks. It is the first RLVR method to systematically enhance exploration capability within the verifiable reasoning paradigm, offering both simplicity and broad applicability.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR's exploration.
Problem

Research questions and friction points this paper is trying to address.

RLVR methods bias toward exploitation over exploration
Probability concentration effect suppresses candidate diversity
SimKO mitigates over-concentration to improve pass@K performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric probability adjustment for top candidates
Mitigates over-concentration in token distributions
Boosts top-K probabilities for correct responses
🔎 Similar Papers
No similar papers found.