A Predictive Law for On-Policy Self-Distillation From World Feedback

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenge of enhancing the scalability of reinforcement learning–based post-training by replacing conventional scalar rewards with richer world feedback. It introduces a method based on On-Policy Self-Distillation (OPSD) and establishes, for the first time, a universal linear relationship between the initial performance gap of student and teacher policies and the eventual performance gain. This empirical law holds consistently across diverse model architectures and context types, and further yields a novel scaling rule under model size variation. Leveraging this insight, the study enables accurate prediction of OPSD outcomes without full-scale training, thereby providing both theoretical grounding and practical guidance for integrating world feedback as a core component in post-training pipelines.

📝 Abstract

Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established methods, such as GRPO, remains unclear. We identify a strikingly consistent linear correlation between the initial student-self-teacher performance gap and the final performance improvement in OPSD. This relationship holds across context types and model families, providing a powerful predictive law for anticipating the outcome of an OPSD configuration without running the full training procedure. Interestingly, we show that this linear predictability holds with model scale, suggesting a potential basis for new empirical scaling laws on larger models with stronger in-context learning capabilities. In essence, our findings show that OPSD performance can be predicted and tuned before training, offering a principled way to incorporate world feedback as a first-class component of the post-training pipeline.

Problem

Research questions and friction points this paper is trying to address.

on-policy self-distillation

world feedback

reinforcement learning

performance prediction

post-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy self-distillation

world feedback

predictive law