From Correction to Mastery: Reinforced Distillation of Large Language Model Agents

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) agents rely on massive backbone models, yet conventional trajectory distillation suffers from error accumulation due to misalignment between teacher and student reasoning processes and knowledge gaps. To address this, we propose SCoRe—a student-centric corrective distillation framework. Unlike standard approaches, SCoRe enables the student to autonomously generate reasoning trajectories, with the teacher intervening only at the first critical error. It further incorporates myopic reinforcement learning to optimize the policy exclusively over corrected trajectory prefixes, enhancing training stability and generalization. Key technical components include iterative reasoning, precise error localization, reward anchoring, and trajectory-guided policy refinement. Evaluated across 12 challenging benchmarks, our 7B student model achieves agent-level performance comparable to that of a 72B teacher model—significantly narrowing the capacity gap and enabling efficient capability transfer.

Technology Category

Application Category

📝 Abstract
Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use, but typically depend on ultra-large, costly backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student often lead to compounding errors. We propose SCoRe, a student-centered framework in which the student generates trajectories and the teacher intervenes only at the first critical error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix before the first critical error, with target rewards assigned at that step. This design encourages autonomous problem-solving beyond imitation and improves training stability. Particularly, on 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe matches the agentic performance of a 72B-parameter teacher.
Problem

Research questions and friction points this paper is trying to address.

Distilling smaller agents from large costly models
Addressing compounding errors in imitation learning
Improving autonomous problem-solving beyond teacher imitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Student-centered distillation with teacher intervention
Fine-tuning on corrected trajectories for improvement
Short-horizon reinforcement learning from verified prefixes
🔎 Similar Papers
No similar papers found.
Y
Yuanjie Lyu
University of Science and Technology of China
Chengyu Wang
Chengyu Wang
Alibaba Group
Natural Language ProcessingLarge Language ModelMulti-modal Learning
J
Jun Huang
Independent Researcher
T
Tong Xu
University of Science and Technology of China