JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

πŸ“… 2026-04-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

204K/year
πŸ€– AI Summary
This work addresses the instability in unsupervised reinforcement learning caused by spurious positive rewards introduced through majority voting or LLM-based judgment. To mitigate this, the authors propose JURY-RL, a framework that decouples answer generation from reward assignment: candidate answers are first generated via model sampling and voting, then rigorously evaluated by a formal verifier (e.g., Lean) to determine eligibility for positive rewards. For samples where verification is inconclusive, the ResZero mechanism applies zero-mean, variance-preserving residual rewards to rejected answers, stabilizing gradient updates without human annotations and avoiding reinforcement of unverifiable consensus. Experiments demonstrate that JURY-RL significantly outperforms existing unsupervised methods across three mathematical foundation models, achieving pass@1 performance approaching that of supervised training while exhibiting superior generalization in pass@k accuracy, response diversity, and cross-task transferβ€”such as to code generation.
πŸ“ Abstract
Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards
Label-Free RL
False Positives
Training Stability
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

label-free RLVR
formal verification
majority voting
ResZero
reward decoupling
πŸ”Ž Similar Papers
No similar papers found.