Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study addresses the persistent instability and frequent training collapse in self-play reinforcement learning, often attributed—without mechanistic clarity—to reward design. By decoupling data gating from reward signals, the work systematically demonstrates that data gating serves as the critical constraint for stability: even with self-consistency rewards lacking ground-truth labels, strict data gating ensures stable training, whereas its removal invariably triggers collapse accompanied by a two-stage phase transition. The paper introduces the “supervised proposer paradox” to elucidate the asymmetric roles of gating and reward, proving that reward design is not the decisive factor. Controlled experiments on Python output prediction, deterministic DSL tasks, and continuous gating parameter ε empirically validate the dominant role of data gating in maintaining training stability.

📝 Abstract

Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the training pool, and the reward signal that updates the policy on tasks already admitted. Through controlled experiments on a Python output-prediction task and a deterministic-DSL twin task that strips pretraining priors, output ambiguity, and executor noise, we find the two levers are asymmetric. A strict gate is sufficient for stability under every reward variant we test, including a self-consistency reward with no access to ground truth; while no reward variant is sufficient once the gate is removed. This asymmetry exposes a counter-intuitive coupling we call the Grounded Proposer Paradox: a proposer with ground-truth access accelerates collapse faster than an ungrounded one when paired with a self-consistency solver, by concentrating training on clean tasks that form the fastest path to a spurious self-consistent attractor. Replacing the binary gate with a continuous strictness parameter $\varepsilon$ further reveals a two-stage phase transition: training-side metrics decouple at low $\varepsilon$, while validation accuracy holds until $\varepsilon$ is much higher. Data-level gating, not reward calibration, is the binding constraint on self-play stability.

Problem

Research questions and friction points this paper is trying to address.

self-play reinforcement learning

training collapse

stability

data gating

reward grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

data gating

reward grounding

self-play reinforcement learning