SOD: Step-wise On-policy Distillation for Small Language Model Agents

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the limitations of small language models in tool-augmented reasoning, which suffer from unstable long-range interactions and capacity bottlenecks, further exacerbated by cascading errors in conventional policy distillation due to incorrect tool calls. To mitigate these issues, the authors propose a stepwise online policy distillation framework featuring a novel step-level adaptive mechanism. This mechanism dynamically modulates distillation intensity based on the KL divergence between teacher and student policies at each step, suppressing misleading supervision in high-disagreement regions while preserving dense guidance during alignment. The approach effectively curbs trajectory divergence and substantially enhances reasoning capabilities of compact models, achieving up to a 20.86% performance gain across mathematical, scientific, and coding benchmarks. Notably, a 0.6B-parameter model attains 26.13% accuracy on AIME 2025, significantly outperforming existing baselines.

📝 Abstract

Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.

Problem

Research questions and friction points this paper is trying to address.

tool-integrated reasoning

on-policy distillation

small language models

student-teacher divergence

cascading errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-wise On-policy Distillation

Tool-integrated Reasoning

Adaptive Reweighting