JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Large language models in reinforcement learning–based reasoning often suffer from inefficient, lengthy trial-and-error processes due to optimization solely for final answer correctness, compromising both efficiency and verifiability. To address this, this work proposes a two-stage “discriminate-then-generate” paradigm: first training a model to discriminate the quality of solutions with verifiable answers, then leveraging this discriminative capability as a prior to initialize and guide the generative model during reinforcement learning. Integrating discriminative fine-tuning with verification-based reinforcement learning (RLVR), the approach achieves substantial improvements on the Qwen3-30B-A3B architecture—yielding a 3.7-point average accuracy gain and 42% reduction in generation length on mathematical reasoning tasks, along with a 4.5-point accuracy improvement on out-of-domain benchmarks—demonstrating enhanced reasoning efficiency, accuracy, and cross-domain generalization.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards

reasoning efficiency

Large Language Models

solution verification

verbosity-reasoning trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

JudgeRLVR

verifiable rewards

two-stage reasoning