Self-Trained Verification for Training- and Test-Time Self-Improvement

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the limited self-improvement capability of current reasoning models, which stems from verifiers’ inability to effectively recognize their own errors during both training and inference. To overcome this, the authors propose Self-Training Verification (STV), a novel approach that, for the first time, leverages the discrepancy between a verifier’s responses to candidate and reference solutions as a supervisory signal, enabling the verifier to emulate a more informed version of itself. The method further introduces a Verifier-in-the-Loop (ViL) mechanism that jointly optimizes the generator through reinforcement learning, a verify-and-refine cycle, and meta-verification contrastive learning. Experiments demonstrate that STV boosts accuracy on complex scientific reasoning tasks from 1.5% to 21%—a 14-fold improvement—and yields a 30% relative increase in pass@1 performance for the generator when evaluated independently after training.

📝 Abstract

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification.

Problem

Research questions and friction points this paper is trying to address.

self-improvement

verifier

self-training

verification-refinement

reasoning models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Trained Verification

Verification-Refinement Loop

Verifier-in-the-Loop Training