🤖 AI Summary
Large language models in reinforcement learning–based reasoning often suffer from inefficient, lengthy trial-and-error processes due to optimization solely for final answer correctness, compromising both efficiency and verifiability. To address this, this work proposes a two-stage “discriminate-then-generate” paradigm: first training a model to discriminate the quality of solutions with verifiable answers, then leveraging this discriminative capability as a prior to initialize and guide the generative model during reinforcement learning. Integrating discriminative fine-tuning with verification-based reinforcement learning (RLVR), the approach achieves substantial improvements on the Qwen3-30B-A3B architecture—yielding a 3.7-point average accuracy gain and 42% reduction in generation length on mathematical reasoning tasks, along with a 4.5-point accuracy improvement on out-of-domain benchmarks—demonstrating enhanced reasoning efficiency, accuracy, and cross-domain generalization.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.