Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Reinforcement learning in reasoning tasks suffers from sparse outcome supervision and difficulty in credit assignment across intermediate steps, while existing process supervision relies on costly human annotations that hinder scalability. This work proposes a novel paradigm termed “supervision internalization,” which leverages a self-reflection mechanism to automatically identify and correct failed reasoning trajectories, thereby generating fine-grained process-level supervision signals endogenously from only outcome feedback—without requiring external annotations. This approach enables precise credit assignment and significantly improves both policy training efficiency and reasoning performance, offering a scalable pathway toward fine-grained reinforcement learning for complex reasoning tasks.

📝 Abstract

The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

reasoning

outcome supervision

process supervision

credit assignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

process supervision

outcome supervision

reinforcement learning for reasoning