JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models in reinforcement learning–based reasoning often suffer from inefficient, lengthy trial-and-error processes due to optimization solely for final answer correctness, compromising both efficiency and verifiability. To address this, this work proposes a two-stage “discriminate-then-generate” paradigm: first training a model to discriminate the quality of solutions with verifiable answers, then leveraging this discriminative capability as a prior to initialize and guide the generative model during reinforcement learning. Integrating discriminative fine-tuning with verification-based reinforcement learning (RLVR), the approach achieves substantial improvements on the Qwen3-30B-A3B architecture—yielding a 3.7-point average accuracy gain and 42% reduction in generation length on mathematical reasoning tasks, along with a 4.5-point accuracy improvement on out-of-domain benchmarks—demonstrating enhanced reasoning efficiency, accuracy, and cross-domain generalization.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards
reasoning efficiency
Large Language Models
solution verification
verbosity-reasoning trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

JudgeRLVR
verifiable rewards
two-stage reasoning
discriminative guidance
efficient generation
🔎 Similar Papers
No similar papers found.
J
Jiangshan Duo
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; LLM-Core Xiaomi
H
Hanyu Li
CFCS, School of Computer Science, Peking University; LLM-Core Xiaomi
H
Hailin Zhang
LLM-Core Xiaomi
Yudong Wang
Yudong Wang
Peking University
NLPLLMdeep learningmachine learning
S
Sujian Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; LLM-Core Xiaomi
Liang Zhao
Liang Zhao
StepFun
MLLMLLM