Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In test-time scaling, reward models struggle to identify sparse correct answers, leading to performance bottlenecks dominated by majority voting. Method: This paper proposes Mirror-Critique—a framework that employs instruction-tuned small language models to generate high-quality, multi-perspective critique signals contrasting true and false solutions; integrates rejection sampling with reinforcement learning-based verifier training (RLVR) to build a critical-capable Mirror-Verifier; and introduces multi-critique aggregation scoring and selective abstention to enhance detection of rare correct solutions and improve model honesty. Contribution/Results: Experiments demonstrate substantial improvements over majority voting across multiple reasoning benchmarks, achieving significant gains in both answer accuracy and reliable capability boundary identification—particularly for low-frequency correct outputs.

Technology Category

Application Category

📝 Abstract
Test-time scaling via solution sampling and aggregation has become a key paradigm for improving the reasoning performance of Large Language Models (LLMs). While reward model selection is commonly employed in this approach, it often fails to identify minority-yet-correct answers, which limits its effectiveness beyond that of simple majority voting. We argue that this limitation stems from a lack of informative critique signals during verifier training. To bridge this gap, we introduce Mirror-Critique, a framework that trains a verifier with informative critiques. Our key insight is to leverage the rich critique signal by contrasting model-generated solutions with ground-truth solutions. We deploy a small instruction-tuned model to synthesize high-quality critique data with rejection sampling that teaches the verifier not only what is wrong, but also why. The synthetic data is used to cold-start the LLMs in the RLVR process to further improve the verification ability. The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution, aggregating them into a verify score used for weighted voting or selective abstention. The experimental results show that our Mirror-Verifier significantly outperforms majority voting in terms of solution accuracy and also improves the solver's honesty to recognize and abstain from answering beyond its capability boundaries.
Problem

Research questions and friction points this paper is trying to address.

Improving reasoning accuracy beyond majority voting limitations
Training verifiers with informative critique signals for solutions
Enhancing model honesty through selective abstention mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trains verifier using contrastive critique signals
Synthesizes critique data via rejection sampling
Aggregates multiple critiques for weighted voting
🔎 Similar Papers
No similar papers found.
Z
Zhicheng Yang
The Hong Kong University of Science and Technology (Guangzhou)
Zhijiang Guo
Zhijiang Guo
HKUST (GZ) | HKUST
Natural Language ProcessingMachine LearningLarge Language Models
Yinya Huang
Yinya Huang
Postdoc Fellow at ETH AI Center, ETH Zürich; Prev. CityU Hong Kong, SYSU
AI for MathAI for ScienceReliable Machine LearningLLMsNLP
Y
Yongxin Wang
Sun Yat-sen University, MBZUAI
Y
Yiwei Wang
University of California, Merced
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning
J
Jing Tang
The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology