Self-Distilled Agentic Reinforcement Learning

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the challenge in long-horizon, multi-turn agent tasks where reinforcement learning provides only trajectory-level rewards, offering insufficient guidance for fine-grained, token-level decision-making. Existing self-distillation approaches suffer from negative teacher rejection and accumulated instability during training. To overcome these limitations, the authors propose the SDAR framework, which introduces on-policy self-distillation as a gated auxiliary objective integrated with the GRPO algorithm. A learnable sigmoid gate dynamically modulates the strength of token-level distillation, amplifying reliable positive signals while softly suppressing unreliable negative samples. Evaluated on ALFWorld, WebShop, and Search-QA, SDAR outperforms the GRPO baseline by 9.4%, 10.2%, and 7.0%, respectively, effectively mitigating training instability and consistently surpassing existing hybrid RL–OPSD methods across varying model scales.

📝 Abstract

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Self-Distillation

Multi-turn Agents

Token-level Supervision

Training Instability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Distillation

Reinforcement Learning

Token-level Supervision