Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing research has not clarified whether the reasoning chains produced by reinforcement learning from verifiable rewards (RLVR) are causally important or sufficient for model answers. This work proposes two quantitative metrics—Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR)—to systematically evaluate the actual role of reasoning chains under RLVR. Experiments on the Qwen2.5 model series and ReasoningGym tasks show that while RLVR improves task accuracy, it does not enhance CIR or SR. However, incorporating a small amount of supervised fine-tuning (SFT) or jointly optimizing CIR/SR-based rewards significantly strengthens the causal validity and sufficiency of reasoning without compromising accuracy, thereby revealing an effective pathway to improve reasoning quality.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning from Verifiable Rewards

Chain-of-Thought Reasoning

Causal Importance

Sufficiency of Reasoning

Language Model Post-Training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Importance of Reasoning

Sufficiency of Reasoning

Reinforcement Learning from Verifiable Rewards