MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the learning signal deficiency in Reinforcement Learning with Verifiable Rewards (RLVR) caused by sparse rewards, this paper proposes MEML-GRPO—a Multi-Expert Mutual Learning framework. MEML-GRPO leverages heterogeneous expert prompts to generate diverse responses and introduces an inter-expert mutual learning mechanism to enable cross-prompt knowledge transfer, alleviating the zero-reward problem without additional human annotation. Its key innovations lie in the synergistic integration of prompt engineering, response diversity sampling, and implicit knowledge sharing, thereby substantially enhancing model generalization and training efficiency on reasoning tasks. Extensive evaluations across multiple benchmarks demonstrate consistent improvements: Qwen and Llama models achieve average gains of 4.89% and 11.33%, respectively—surpassing the performance ceiling of conventional RLVR approaches.

Technology Category

Application Category

📝 Abstract

Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model's performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses reward sparsity in RLVR for LLMs

Enhances reasoning via multi-expert mutual learning

Improves performance on challenging reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous multi-expert mutual learning framework

Diverse expert prompts for broader response generation

Inter-expert knowledge sharing boosts RLVR performance

🔎 Similar Papers

Adaptive Task Allocation in Multi-Human Multi-Robot Teams under Team Heterogeneity and Dynamic Information Uncertainty