From Correction to Mastery: Reinforced Distillation of Large Language Model Agents

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language model (LLM) agents rely on massive backbone models, yet conventional trajectory distillation suffers from error accumulation due to misalignment between teacher and student reasoning processes and knowledge gaps. To address this, we propose SCoRe—a student-centric corrective distillation framework. Unlike standard approaches, SCoRe enables the student to autonomously generate reasoning trajectories, with the teacher intervening only at the first critical error. It further incorporates myopic reinforcement learning to optimize the policy exclusively over corrected trajectory prefixes, enhancing training stability and generalization. Key technical components include iterative reasoning, precise error localization, reward anchoring, and trajectory-guided policy refinement. Evaluated across 12 challenging benchmarks, our 7B student model achieves agent-level performance comparable to that of a 72B teacher model—significantly narrowing the capacity gap and enabling efficient capability transfer.

Technology Category

Application Category

📝 Abstract

Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use, but typically depend on ultra-large, costly backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student often lead to compounding errors. We propose SCoRe, a student-centered framework in which the student generates trajectories and the teacher intervenes only at the first critical error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix before the first critical error, with target rewards assigned at that step. This design encourages autonomous problem-solving beyond imitation and improves training stability. Particularly, on 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe matches the agentic performance of a 72B-parameter teacher.

Problem

Research questions and friction points this paper is trying to address.

Distilling smaller agents from large costly models

Addressing compounding errors in imitation learning

Improving autonomous problem-solving beyond teacher imitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Student-centered distillation with teacher intervention

Fine-tuning on corrected trajectories for improvement

Short-horizon reinforcement learning from verified prefixes

🔎 Similar Papers

No similar papers found.

Authors to Follow