Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

๐Ÿ“… 2026-05-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

194K/year
๐Ÿค– AI Summary
This work addresses a key limitation of conventional self-distillation, which suppresses a student modelโ€™s autonomous reasoning by forcibly overriding its outputs even when they are correct. To overcome this, the authors propose RLRTโ€”a novel method that leverages information asymmetry as a design principle. By inverting teacher signals along correct reasoning paths, RLRT reinforces the studentโ€™s self-generated correct tokens, thereby encouraging valuable exploration rather than indiscriminate diversity. The approach integrates GRPO-based reinforcement learning, self-distillation, and a reversed teacher signal mechanism for post-training optimization of large language models. Extensive experiments across multiple checkpoints of the Qwen3 series demonstrate that RLRT consistently outperforms standard self-distillation and existing exploration strategies, confirming its effectiveness and generalizability.
๐Ÿ“ Abstract
Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.
Problem

Research questions and friction points this paper is trying to address.

self-distillation
reasoning suppression
teacher-student asymmetry
LLM post-training
autonomous reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation
reinforcement learning
reasoning exploration
information asymmetry
RLVR
๐Ÿ”Ž Similar Papers