π€ AI Summary
To address low exploration efficiency and insufficient policy diversity in large language models (LLMs) trained via reinforcement learning with verifiable rewards (RLVR), this paper proposes MENTORβa hybrid expert-guided navigation framework. Its core innovation lies in introducing selective, token-level expert guidance exclusively at critical decision points, thereby avoiding full-trajectory imitation and balancing effective exploration with policy diversity. By analyzing expert trajectories and performing token-level policy optimization, MENTOR implements a lightweight, precise intervention mechanism that significantly enhances policy generalization. Experiments demonstrate that MENTOR accurately captures the essence of expert strategies, substantially outperforming existing RLVR and imitation learning methods on multi-task reasoning benchmarks. Moreover, it improves both reasoning performance and robustness.
π Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs). However, the effectiveness of RLVR strongly depends on the capability of base models. This issue arises because it requires the model to have sufficient capability to perform high-quality exploration, which involves both effectiveness and diversity. Unfortunately, existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity. To address this, we argue that the expert only needs to provide guidance only at critical decision points rather than the entire reasoning path. Based on this insight, we propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning, a framework that provides expert guidance only at critical decision points to perform effective and diverse exploration in RLVR. Extensive experiments show that MENTOR enables models capture the essence of expert strategies rather than surface imitation, thereby performing high-quality exploration and achieving superior overall performance. Our code is available online.