🤖 AI Summary
This work addresses a critical limitation in Verifiable Reward Reinforcement Learning (RLVR), where improved single-sample accuracy comes at the cost of reduced multi-sample coverage and diversity collapse due to policies assigning indistinguishable probabilities to correct solutions. To resolve this, we propose Uniform-Correct Policy Optimization (UCPO), which provides the first formal analysis of this diversity collapse mechanism and establishes the “uniform-correct policy”—assigning equal probability mass to all verified correct answers—as the unique optimal structure under robustness and entropy-regularized optimality criteria. Integrating conditional uniformity penalties, probability mass redistribution, and gradient adjustment within the GRPO framework, UCPO maintains competitive Pass@1 performance while significantly enhancing Pass@K metrics (up to 10% improvement in Pass@64 on AIME24) and intra-correct-solution diversity (achieving up to 45% higher equation-level diversity) across 1.5B–7B models and five mathematical reasoning benchmarks.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10\% absolute improvement on AIME24 at Pass@64 and up to 45\% higher equation-level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.