Learning to Self-Verify Makes Language Models Better Reasoners

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Large language models excel at generating reasoning paths but exhibit notably weak self-verification capabilities, leading to a significant imbalance between generation and verification. To address this, this work proposes a multi-task reinforcement learning framework that jointly optimizes reasoning generation and self-verification as complementary objectives, explicitly integrating self-verification into the training process for the first time. Experimental results demonstrate that the proposed approach not only enhances the model’s self-verification efficiency but also reciprocally improves the accuracy of reasoning generation. The method consistently outperforms conventional approaches that optimize generation alone across multiple benchmarks and model architectures, thereby overcoming key limitations of existing training paradigms.

Technology Category

Application Category

📝 Abstract

Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.

Problem

Research questions and friction points this paper is trying to address.

self-verification

reasoning

large language models

capability asymmetry

generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-verification

reasoning

multi-task reinforcement learning