Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the low sample efficiency of multi-turn large language model agents in reinforcement learning, which stems from sparse rewards and long-horizon dependencies, as well as the limitations of existing self-distillation methods that struggle to preserve policy diversity and often suffer from training collapse. The authors propose Skill-SD, a novel framework that dynamically distills agent trajectories into natural language skill descriptions, serving as privileged information for a teacher model. A student model then internalizes this guidance via skill-conditioned self-distillation while operating under the original task prompt. The approach introduces an importance-weighted reverse KL distillation loss and a teacher-student dynamic synchronization mechanism, effectively enhancing policy diversity and stabilizing training. Evaluated on AppWorld and Sokoban benchmarks, Skill-SD substantially outperforms prior methods, achieving improvements of 14.0%/10.9% over GRPO and 42.1%/40.6% over OPD.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has been widely used to train LLM agents for multi-turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On-policy self-distillation (OPSD) alleviates this by providing dense token-level supervision from a privileged teacher that has access to ground-truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill-SD, a framework that turns the agent's own trajectories into dynamic training-only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance-weighted reverse-KL loss to provide gradient-correct token-level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill-SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: https://k1xe.github.io/skill-sd/

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

sample efficiency

self-distillation

multi-turn LLM agents

sparse rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Skill-Conditioned Self-Distillation

Multi-turn LLM Agents

Dynamic Privileged Information