🤖 AI Summary
In test-time scaling, reward models struggle to identify sparse correct answers, leading to performance bottlenecks dominated by majority voting. Method: This paper proposes Mirror-Critique—a framework that employs instruction-tuned small language models to generate high-quality, multi-perspective critique signals contrasting true and false solutions; integrates rejection sampling with reinforcement learning-based verifier training (RLVR) to build a critical-capable Mirror-Verifier; and introduces multi-critique aggregation scoring and selective abstention to enhance detection of rare correct solutions and improve model honesty. Contribution/Results: Experiments demonstrate substantial improvements over majority voting across multiple reasoning benchmarks, achieving significant gains in both answer accuracy and reliable capability boundary identification—particularly for low-frequency correct outputs.
📝 Abstract
Test-time scaling via solution sampling and aggregation has become a key paradigm for improving the reasoning performance of Large Language Models (LLMs). While reward model selection is commonly employed in this approach, it often fails to identify minority-yet-correct answers, which limits its effectiveness beyond that of simple majority voting. We argue that this limitation stems from a lack of informative critique signals during verifier training. To bridge this gap, we introduce Mirror-Critique, a framework that trains a verifier with informative critiques. Our key insight is to leverage the rich critique signal by contrasting model-generated solutions with ground-truth solutions. We deploy a small instruction-tuned model to synthesize high-quality critique data with rejection sampling that teaches the verifier not only what is wrong, but also why. The synthetic data is used to cold-start the LLMs in the RLVR process to further improve the verification ability. The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution, aggregating them into a verify score used for weighted voting or selective abstention. The experimental results show that our Mirror-Verifier significantly outperforms majority voting in terms of solution accuracy and also improves the solver's honesty to recognize and abstain from answering beyond its capability boundaries.