MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Small language models (SLMs) suffer from poor generalization in tool-use tasks, are prone to overfitting during supervised fine-tuning (SFT), and struggle with standard reinforcement learning (RL) due to sparse rewards. To address these challenges, this paper proposes a teacher-guided dense reward distillation framework. It leverages a large language model (LLM) as a teacher to construct a composite, trajectory-based dense reward function, jointly integrating knowledge distillation and policy optimization within an RL framework to guide SLMs in acquiring transferable tool-calling policies. The core innovation lies in explicitly encoding the teacher’s multi-step reasoning structure into dense, hierarchical reward signals—thereby overcoming the dual limitations of conventional imitation learning and sparse-reward RL. Experiments demonstrate that the proposed method significantly outperforms both SFT and baseline RL approaches on cross-domain tool-use tasks, achieving absolute improvements of 12.7% in policy correctness and 23.4% in generalization accuracy.

Technology Category

Application Category

📝 Abstract

Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.

Problem

Research questions and friction points this paper is trying to address.

Distilling tool-using capabilities from large to small language models

Addressing poor generalization in supervised fine-tuning methods

Solving reward sparsity in reinforcement learning for small models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines reinforcement learning with teacher-guided distillation

Uses teacher reference trajectory for dense reward signals

Learns generalizable policy through exploration not imitation

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study