Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs

πŸ“… 2025-10-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address low exploration efficiency and insufficient policy diversity in large language models (LLMs) trained via reinforcement learning with verifiable rewards (RLVR), this paper proposes MENTORβ€”a hybrid expert-guided navigation framework. Its core innovation lies in introducing selective, token-level expert guidance exclusively at critical decision points, thereby avoiding full-trajectory imitation and balancing effective exploration with policy diversity. By analyzing expert trajectories and performing token-level policy optimization, MENTOR implements a lightweight, precise intervention mechanism that significantly enhances policy generalization. Experiments demonstrate that MENTOR accurately captures the essence of expert strategies, substantially outperforming existing RLVR and imitation learning methods on multi-task reasoning benchmarks. Moreover, it improves both reasoning performance and robustness.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs). However, the effectiveness of RLVR strongly depends on the capability of base models. This issue arises because it requires the model to have sufficient capability to perform high-quality exploration, which involves both effectiveness and diversity. Unfortunately, existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity. To address this, we argue that the expert only needs to provide guidance only at critical decision points rather than the entire reasoning path. Based on this insight, we propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning, a framework that provides expert guidance only at critical decision points to perform effective and diverse exploration in RLVR. Extensive experiments show that MENTOR enables models capture the essence of expert strategies rather than surface imitation, thereby performing high-quality exploration and achieving superior overall performance. Our code is available online.
Problem

Research questions and friction points this paper is trying to address.

Enhancing exploration effectiveness and diversity in RLVR
Addressing base model capability limitations for quality exploration
Providing selective expert guidance at critical decision points
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective expert guidance at critical decision points
Mixed-policy navigation for token-level optimization
Enables effective diverse exploration in reinforcement learning
πŸ”Ž Similar Papers
No similar papers found.
Z
Zishang Jiang
School of Data Science, Fudan University
Jinyi Han
Jinyi Han
Knowledge Works Lab
Large Language Model
T
Tingyun Li
School of Data Science, Fudan University
X
Xinyi Wang
School of Data Science, Fudan University
Sihang Jiang
Sihang Jiang
Fudan University
Knowledge GraphLarge Language Models
Jiaqing Liang
Jiaqing Liang
Fudan University
knowledge graphdeep learning
Z
Zhaoqian Dai
Ant Group
S
Shuguang Ma
Ant Group
F
Fei Yu
Ant Group
Y
Yanghua Xiao
College of Computer Science and Artificial Intelligence, Fudan University