Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently exhibit “superficial self-reflection” in mathematical reasoning—generating answers without substantive post-hoc verification. To address this, we propose RISE, the first online reinforcement learning framework enabling joint optimization of solution generation and self-verification via dual trajectories. RISE employs a dual-head policy network trained end-to-end under PPO, guided by verifiable reward signals that simultaneously optimize both the reasoning path and the self-evaluation path. Crucially, policy updates are driven by a result-based automatic verifier, tightly coupling verification into the RL loop. Evaluated on multiple mathematical reasoning benchmarks, RISE achieves significant accuracy gains while substantially increasing both the frequency and correctness of self-verification. Although verification incurs modest computational overhead, it yields sustained performance improvements. The core contribution lies in deeply integrating self-verification into the RL policy update mechanism—thereby fundamentally mitigating the “pseudo-introspection” problem endemic to current LLM reasoning systems.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) show great promise in complex reasoning, with Reinforcement Learning with Verifiable Rewards (RLVR) being a key enhancement strategy. However, a prevalent issue is ``superficial self-reflection'', where models fail to robustly verify their own outputs. We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this. RISE explicitly and simultaneously trains an LLM to improve both its problem-solving and self-verification abilities within a single, integrated RL process. The core mechanism involves leveraging verifiable rewards from an outcome verifier to provide on-the-fly feedback for both solution generation and self-verification tasks. In each iteration, the model generates solutions, then critiques its own on-policy generated solutions, with both trajectories contributing to the policy update. Extensive experiments on diverse mathematical reasoning benchmarks show that RISE consistently improves model's problem-solving accuracy while concurrently fostering strong self-verification skills. Our analyses highlight the advantages of online verification and the benefits of increased verification compute. Additionally, RISE models exhibit more frequent and accurate self-verification behaviors during reasoning. These advantages reinforce RISE as a flexible and effective path towards developing more robust and self-aware reasoners.
Problem

Research questions and friction points this paper is trying to address.

Addresses superficial self-reflection in LLMs during reasoning
Enhances problem-solving and self-verification in a unified RL process
Improves accuracy and robustness via online feedback from verifiable rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated RL framework for problem-solving and verification
On-the-fly feedback from verifiable rewards
Simultaneous training for solution and critique generation
🔎 Similar Papers
No similar papers found.