Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses a key limitation of conventional self-distillation, which suppresses a student model’s autonomous reasoning by forcibly overriding its outputs even when they are correct. To overcome this, the authors propose RLRT—a novel method that leverages information asymmetry as a design principle. By inverting teacher signals along correct reasoning paths, RLRT reinforces the student’s self-generated correct tokens, thereby encouraging valuable exploration rather than indiscriminate diversity. The approach integrates GRPO-based reinforcement learning, self-distillation, and a reversed teacher signal mechanism for post-training optimization of large language models. Extensive experiments across multiple checkpoints of the Qwen3 series demonstrate that RLRT consistently outperforms standard self-distillation and existing exploration strategies, confirming its effectiveness and generalizability.

📝 Abstract

Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.

Problem

Research questions and friction points this paper is trying to address.

self-distillation

reasoning suppression

teacher-student asymmetry

LLM post-training

autonomous reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation

reinforcement learning

reasoning exploration