MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Small language models (SLMs) suffer from poor generalization in tool-use tasks, are prone to overfitting during supervised fine-tuning (SFT), and struggle with standard reinforcement learning (RL) due to sparse rewards. To address these challenges, this paper proposes a teacher-guided dense reward distillation framework. It leverages a large language model (LLM) as a teacher to construct a composite, trajectory-based dense reward function, jointly integrating knowledge distillation and policy optimization within an RL framework to guide SLMs in acquiring transferable tool-calling policies. The core innovation lies in explicitly encoding the teacher’s multi-step reasoning structure into dense, hierarchical reward signals—thereby overcoming the dual limitations of conventional imitation learning and sparse-reward RL. Experiments demonstrate that the proposed method significantly outperforms both SFT and baseline RL approaches on cross-domain tool-use tasks, achieving absolute improvements of 12.7% in policy correctness and 23.4% in generalization accuracy.

Technology Category

Application Category

📝 Abstract
Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.
Problem

Research questions and friction points this paper is trying to address.

Distilling tool-using capabilities from large to small language models
Addressing poor generalization in supervised fine-tuning methods
Solving reward sparsity in reinforcement learning for small models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines reinforcement learning with teacher-guided distillation
Uses teacher reference trajectory for dense reward signals
Learns generalizable policy through exploration not imitation
🔎 Similar Papers
No similar papers found.
C
ChangSu Choi
Seoul National University of Science and Technology (SEOULTECH)
Hoyun Song
Hoyun Song
Postdoctoral researcher, KAIST
NLPKnowledge IntegrationDomain-Specific ModelingLLM
D
Dongyeon Kim
Korea Advanced Institute of Science and Technology (KAIST)
W
WooHyeon Jung
Korea Advanced Institute of Science and Technology (KAIST)
M
Minkyung Cho
Korea Advanced Institute of Science and Technology (KAIST)
S
Sunjin Park
LG CNS
N
NohHyeob Bae
LG CNS
S
Seona Yu
LG CNS
KyungTae Lim
KyungTae Lim
École normale supérieure
Natural Language Processing