🤖 AI Summary
This work addresses the challenge of learning in sparse-reward environments, where large language models struggle to acquire effective policies through environmental interaction, and existing self-distillation approaches overly rely on successful trajectories while neglecting the informative value of failure feedback. To overcome this limitation, the authors propose Reflection-Enhanced Self-Distillation (RESD), a novel framework that actively reconstructs failed trajectories into token-level supervision signals. RESD employs local reflection to diagnose errors and integrates a persistent global experience buffer to enable cross-episode knowledge reuse, thereby providing fine-grained supervision even in the complete absence of successful rollouts. Experimental results demonstrate that RESD significantly outperforms baseline methods across multiple continual learning tasks, achieving up to 8× higher sample efficiency than GRPO and markedly faster early-stage performance gains with only a single rollout per iteration.
📝 Abstract
Enabling Large Language Models (LLMs) to continuously improve from environmental interactions is a central challenge in post-training. While on-policy self-distillation offers a promising paradigm, existing methods predominantly treat environmental feedback as a passive conditioning signal. Consequently, they heavily rely on successful demonstrations and struggle to learn in rare-success regimes. To bridge this gap, we introduce Reflection-Enhanced Self-Distillation (RESD), a framework that transforms raw failure feedback into an active source of corrective supervision. Instead of passively appending feedback, RESD interprets failed trajectories by generating retrospective reflections to diagnose local errors, and curates a persistent global playbook to preserve reusable lessons across training steps. The enriched context enables the self-teacher to provide actionable token-level supervision even in the absence of successful rollouts. Empirical evaluations on multiple continual learning tasks demonstrate that RESD substantially outperforms standard self-distillation baselines. Furthermore, RESD achieves significantly faster early-stage improvement than GRPO with $8\times$ samples using only a single rollout per prompt, highlighting its superior interaction efficiency.