๐ค AI Summary
This work addresses a key limitation of conventional self-distillation, which suppresses a student modelโs autonomous reasoning by forcibly overriding its outputs even when they are correct. To overcome this, the authors propose RLRTโa novel method that leverages information asymmetry as a design principle. By inverting teacher signals along correct reasoning paths, RLRT reinforces the studentโs self-generated correct tokens, thereby encouraging valuable exploration rather than indiscriminate diversity. The approach integrates GRPO-based reinforcement learning, self-distillation, and a reversed teacher signal mechanism for post-training optimization of large language models. Extensive experiments across multiple checkpoints of the Qwen3 series demonstrate that RLRT consistently outperforms standard self-distillation and existing exploration strategies, confirming its effectiveness and generalizability.
๐ Abstract
Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.