VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multimodal reinforcement learning, population-based policy optimization methods (e.g., GRPO/GSPO) suffer from advantage estimation collapse and vanishing gradients when group-wise rewards are highly consistent. Existing mitigation strategies—either filtering-based or sampling-based—exhibit critical limitations: excessive computational overhead or poor online adaptability, respectively. To address these issues, we propose the Dynamic Difficulty-Aware Sampling (DDAS) framework, the first to integrate *online sample difficulty estimation* and *variance-aware selection* into grouped policy optimization. DDAS models sample difficulty via a Beta distribution, employs Thompson sampling to prioritize high-information samples, and introduces dual-scale prior decay to dynamically adapt to policy evolution. Crucially, DDAS requires no additional rollouts and is plug-and-play. Evaluated on multimodal reasoning benchmarks, DDAS significantly outperforms strong baselines, improving both model performance and sample efficiency while reducing computational cost by over 30%.

Technology Category

Application Category

📝 Abstract
Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical emph{gradient vanishing} problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose extbf{VADE}, a extbf{V}ariance- extbf{A}ware extbf{D}ynamic sampling framework via online sample-level difficulty extbf{E}stimation. Our framework integrates three key components: online sample-level difficulty estimation using Beta distributions, a Thompson sampler that maximizes information gain through the estimated correctness probability, and a two-scale prior decay mechanism that maintains robust estimation under policy evolution. This three components design enables VADE to dynamically select the most informative samples, thereby amplifying training signals while eliminating extra rollout costs. Extensive experiments on multimodal reasoning benchmarks show that VADE consistently outperforms strong baselines in both performance and sample efficiency, while achieving a dramatic reduction in computational overhead. More importantly, our framework can serves as a plug-and-play component to be seamlessly integrated into existing group-based RL algorithms. Code and models are available at https://VADE-RL.github.io.
Problem

Research questions and friction points this paper is trying to address.

Addresses gradient vanishing in group-based RL when all responses receive identical rewards
Solves computational inefficiency of filtering-based and static sampling-based methods
Enables dynamic sample selection for multimodal RL with real-time adaptability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online sample-level difficulty estimation using Beta distributions
Thompson sampler maximizing information gain via correctness probability
Two-scale prior decay mechanism maintaining robust policy estimation
🔎 Similar Papers
No similar papers found.