More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing RLVR methods rely either on self-exploration or single off-policy teacher guidance for LongCoT, leading to biased reasoning and limited exploration diversity. To address this, we propose AMPO—a novel framework featuring an on-demand guidance mechanism and a comprehension-aware, multi-teacher adaptive selection strategy: multiple teachers are dynamically fused only upon model inference failure, preserving autonomous exploration while enriching reasoning path diversity. AMPO integrates multi-teacher knowledge distillation, policy gradient optimization, a dynamic triggering mechanism, and understanding-aware path selection, establishing an online multi-instructor reinforcement learning paradigm. On mathematical reasoning benchmarks, AMPO outperforms the strong baseline GRPO by 4.3% in accuracy, improves out-of-distribution generalization by 12.2%, and significantly enhances Pass@k performance and generation diversity—matching or exceeding the efficacy of stronger single-teacher approaches.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This"guidance-on-demand"approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.

Problem

Research questions and friction points this paper is trying to address.

Overcoming model bias in reinforcement learning for reasoning tasks

Enhancing exploration diversity in long chain-of-thought reasoning

Improving reasoning performance without requiring superior teacher models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Multi-Guidance Policy Optimization leverages multiple teachers

Uses guidance-on-demand approach for diverse exploration

Implements comprehension-based selection for effective learning

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study