🤖 AI Summary
Small language models (SLMs) suffer from poor generalization in tool-use tasks, are prone to overfitting during supervised fine-tuning (SFT), and struggle with standard reinforcement learning (RL) due to sparse rewards. To address these challenges, this paper proposes a teacher-guided dense reward distillation framework. It leverages a large language model (LLM) as a teacher to construct a composite, trajectory-based dense reward function, jointly integrating knowledge distillation and policy optimization within an RL framework to guide SLMs in acquiring transferable tool-calling policies. The core innovation lies in explicitly encoding the teacher’s multi-step reasoning structure into dense, hierarchical reward signals—thereby overcoming the dual limitations of conventional imitation learning and sparse-reward RL. Experiments demonstrate that the proposed method significantly outperforms both SFT and baseline RL approaches on cross-domain tool-use tasks, achieving absolute improvements of 12.7% in policy correctness and 23.4% in generalization accuracy.
📝 Abstract
Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.