🤖 AI Summary
This work addresses the limited self-improvement capability of current reasoning models, which stems from verifiers’ inability to effectively recognize their own errors during both training and inference. To overcome this, the authors propose Self-Training Verification (STV), a novel approach that, for the first time, leverages the discrepancy between a verifier’s responses to candidate and reference solutions as a supervisory signal, enabling the verifier to emulate a more informed version of itself. The method further introduces a Verifier-in-the-Loop (ViL) mechanism that jointly optimizes the generator through reinforcement learning, a verify-and-refine cycle, and meta-verification contrastive learning. Experiments demonstrate that STV boosts accuracy on complex scientific reasoning tasks from 1.5% to 21%—a 14-fold improvement—and yields a 30% relative increase in pass@1 performance for the generator when evaluated independently after training.
📝 Abstract
Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification.