Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

In test-time scaling, reward models struggle to identify sparse correct answers, leading to performance bottlenecks dominated by majority voting. Method: This paper proposes Mirror-Critique—a framework that employs instruction-tuned small language models to generate high-quality, multi-perspective critique signals contrasting true and false solutions; integrates rejection sampling with reinforcement learning-based verifier training (RLVR) to build a critical-capable Mirror-Verifier; and introduces multi-critique aggregation scoring and selective abstention to enhance detection of rare correct solutions and improve model honesty. Contribution/Results: Experiments demonstrate substantial improvements over majority voting across multiple reasoning benchmarks, achieving significant gains in both answer accuracy and reliable capability boundary identification—particularly for low-frequency correct outputs.

Technology Category

Application Category

📝 Abstract

Test-time scaling via solution sampling and aggregation has become a key paradigm for improving the reasoning performance of Large Language Models (LLMs). While reward model selection is commonly employed in this approach, it often fails to identify minority-yet-correct answers, which limits its effectiveness beyond that of simple majority voting. We argue that this limitation stems from a lack of informative critique signals during verifier training. To bridge this gap, we introduce Mirror-Critique, a framework that trains a verifier with informative critiques. Our key insight is to leverage the rich critique signal by contrasting model-generated solutions with ground-truth solutions. We deploy a small instruction-tuned model to synthesize high-quality critique data with rejection sampling that teaches the verifier not only what is wrong, but also why. The synthetic data is used to cold-start the LLMs in the RLVR process to further improve the verification ability. The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution, aggregating them into a verify score used for weighted voting or selective abstention. The experimental results show that our Mirror-Verifier significantly outperforms majority voting in terms of solution accuracy and also improves the solver's honesty to recognize and abstain from answering beyond its capability boundaries.

Problem

Research questions and friction points this paper is trying to address.

Improving reasoning accuracy beyond majority voting limitations

Training verifiers with informative critique signals for solutions

Enhancing model honesty through selective abstention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trains verifier using contrastive critique signals

Synthesizes critique data via rejection sampling

Aggregates multiple critiques for weighted voting

🔎 Similar Papers

No similar papers found.