Self-Distilled Agentic Reinforcement Learning

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the challenge in long-horizon, multi-turn agent tasks where reinforcement learning provides only trajectory-level rewards, offering insufficient guidance for fine-grained, token-level decision-making. Existing self-distillation approaches suffer from negative teacher rejection and accumulated instability during training. To overcome these limitations, the authors propose the SDAR framework, which introduces on-policy self-distillation as a gated auxiliary objective integrated with the GRPO algorithm. A learnable sigmoid gate dynamically modulates the strength of token-level distillation, amplifying reliable positive signals while softly suppressing unreliable negative samples. Evaluated on ALFWorld, WebShop, and Search-QA, SDAR outperforms the GRPO baseline by 9.4%, 10.2%, and 7.0%, respectively, effectively mitigating training instability and consistently surpassing existing hybrid RL–OPSD methods across varying model scales.
📝 Abstract
Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Self-Distillation
Multi-turn Agents
Token-level Supervision
Training Instability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Distillation
Reinforcement Learning
Token-level Supervision
Gated Auxiliary Objective
Agentic AI
Zhengxi Lu
Zhengxi Lu
Zhejiang University
MLLMAgent
Zhiyuan Yao
Zhiyuan Yao
Ph.D. in Financial Engineering, Stevens Institute of Technology
Reinforcement LearningMachine LearningML/RL in Financial Trading
Z
Zhuowen Han
Meituan
Z
Zi-Han Wang
Tsinghua University
J
Jinyang Wu
Tsinghua University
Q
Qi Gu
Meituan
X
Xunliang Cai
Meituan
Weiming Lu
Weiming Lu
Zhejiang University
Natural Language ProcessingLarge Language ModelsAGI
J
Jun Xiao
Zhejiang University
Y
Yueting Zhuang
Zhejiang University
Y
Yongliang Shen
Zhejiang University