MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limitations of conventional deterministic top-K routing in vision-language Mixture-of-Experts (MoE) models, which often leads to homogeneous expert selection and overfitting, thereby constraining performance gains. To overcome this, the study introduces reinforcement learning into MoE routing optimization for the first time, formulating expert selection as a sequential decision-making problem. The authors propose a novel routing strategy based on Group Relative Policy Optimization (GRPO) and incorporate a modality-aware router guidance mechanism to enhance training stability and efficiency. This approach enables task-level expert specialization while promoting diverse expert utilization. Extensive experiments demonstrate that the method significantly outperforms standard top-K routing and its variants across multimodal image and video benchmarks, effectively mitigating overfitting and improving overall model performance.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

Vision-Language Models

expert routing

expert overfitting

top-K routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Reinforcement Learning

Vision-Language Models