Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

A long-standing debate in reinforcement learning for large language models concerns the relative statistical difficulty of process supervision versus outcome supervision, with conventional wisdom favoring the former. Method: This paper theoretically analyzes their statistical complexity under standard data coverage assumptions and introduces the trajectory measure transformation lemma—a novel tool that formally links return-oriented trajectory distributions to step-level distributional shifts. It further proves that any policy’s advantage function serves as an optimal process reward model. Results: The work establishes statistical equivalence between process and outcome supervision, demonstrating that observed performance gaps stem from algorithmic implementation flaws—not intrinsic statistical hardness. By unifying the theoretical foundations of both paradigms, it provides rigorous justification for lightweight, high-efficiency outcome supervision, challenging the prevailing reliance on fine-grained process annotations.

Technology Category

Application Category

📝 Abstract

As large language models have evolved, it has become crucial to distinguish between process supervision and outcome supervision -- two key reinforcement learning approaches to complex reasoning tasks. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data. In this paper, we take steps towards resolving this debate. Our main theorem shows that, under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision, up to polynomial factors in horizon. At the core of this result lies the novel Change of Trajectory Measure Lemma -- a technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy's advantage function can serve as an optimal process reward model, providing a direct connection between outcome and process supervision. These findings suggest that the empirically observed performance gap -- if any -- between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data collection and algorithm design for reinforcement learning.

Problem

Research questions and friction points this paper is trying to address.

Distinguishing process and outcome supervision in RL

Comparing statistical difficulty of supervision methods

Linking policy advantage to optimal process reward

Innovation

Methods, ideas, or system contributions that make the work stand out.

Outcome supervision equals process supervision

Change of Trajectory Measure Lemma

Policy's advantage as process reward model

🔎 Similar Papers

Addressing and Visualizing Misalignments in Human Task-Solving Trajectories