🤖 AI Summary
Existing reinforcement learning for verifiable reasoning (RLVR) methods suffer from insufficient exploration: while pass@1 improves, pass@K (K>1) degrades, primarily due to verification rewards causing excessive token-level concentration on the top-1 candidate.
Method: We propose SimKO, a reinforcement learning optimization framework for multi-path reasoning. SimKO identifies high-confidence error dominance via token-level probability analysis and introduces an asymmetric gradient update strategy—selectively boosting probabilities of top-K correct continuations at high-entropy tokens while suppressing the over-dominance of top-1 incorrect responses. It requires no architectural modifications or reward function redesign, relying solely on verifiable reward signals and probabilistic distribution control.
Contribution/Results: SimKO significantly improves pass@K (K=5,10,20) across multiple mathematical and logical reasoning benchmarks. It is the first RLVR method to systematically enhance exploration capability within the verifiable reasoning paradigm, offering both simplicity and broad applicability.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR's exploration.