DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability in training multimodal large language models with GRPO reinforcement learning, which often stems from sparse rewards and vanishing advantage signals—particularly when tasks are either too easy or too difficult, leading to insufficient optimization signals. To mitigate this, the authors propose a difficulty-adaptive variant advantage method that dynamically assesses task difficulty through a global difficulty-aware mechanism, samples difficulty-matched variants, and computes advantage values weighted and normalized by difficulty by integrating both local and global group information. This approach effectively alleviates reward sparsity and advantage collapse, achieving significant performance gains over existing methods across six mainstream multimodal reasoning benchmarks while simultaneously improving both training efficiency and inference performance.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO
Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning
reinforcement learning
reward sparsity
advantage vanishing
group relative policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Difficulty-Adaptive
Variant Advantage
Group Relative Policy Optimization
Multimodal Reasoning
Reward Sparsity
🔎 Similar Papers
H
Haowen Gao
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Z
Zhenyu Zhang
Kuaishou Technology, Beijing, China
Liang Pang
Liang Pang
Associate Professor, Institute of Computing Technology, Chinese Academy of Sciences
Large Language ModelSemantic MatchingQuestion AnsweringText MatchingText Generation
F
Fangda Guo
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China
Hongjian Dou
Hongjian Dou
Alibaba
Recommender System
G
Guannan Lv
Kuaishou Technology, Beijing, China
Shaoguo Liu
Shaoguo Liu
Alibaba Corporation
Maching LearningComputer Vision
T
Tingting Gao
Kuaishou Technology, Beijing, China
H
Huawei Shen
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China
Xueqi Cheng
Xueqi Cheng
Ph.D. student, Florida State University
Data miningLLMGNNComputational social science