Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address low exploration efficiency and insufficient policy diversity in large language models (LLMs) trained via reinforcement learning with verifiable rewards (RLVR), this paper proposes MENTOR—a hybrid expert-guided navigation framework. Its core innovation lies in introducing selective, token-level expert guidance exclusively at critical decision points, thereby avoiding full-trajectory imitation and balancing effective exploration with policy diversity. By analyzing expert trajectories and performing token-level policy optimization, MENTOR implements a lightweight, precise intervention mechanism that significantly enhances policy generalization. Experiments demonstrate that MENTOR accurately captures the essence of expert strategies, substantially outperforming existing RLVR and imitation learning methods on multi-task reasoning benchmarks. Moreover, it improves both reasoning performance and robustness.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs). However, the effectiveness of RLVR strongly depends on the capability of base models. This issue arises because it requires the model to have sufficient capability to perform high-quality exploration, which involves both effectiveness and diversity. Unfortunately, existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity. To address this, we argue that the expert only needs to provide guidance only at critical decision points rather than the entire reasoning path. Based on this insight, we propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning, a framework that provides expert guidance only at critical decision points to perform effective and diverse exploration in RLVR. Extensive experiments show that MENTOR enables models capture the essence of expert strategies rather than surface imitation, thereby performing high-quality exploration and achieving superior overall performance. Our code is available online.

Problem

Research questions and friction points this paper is trying to address.

Enhancing exploration effectiveness and diversity in RLVR

Addressing base model capability limitations for quality exploration

Providing selective expert guidance at critical decision points

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective expert guidance at critical decision points

Mixed-policy navigation for token-level optimization

Enables effective diverse exploration in reinforcement learning

🔎 Similar Papers

No similar papers found.