Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing multimodal post-training paradigms suffer from two key limitations: (1) the lack of quantifiable metrics for sample difficulty, and (2) insufficient joint optimization of perceptual and reasoning capabilities. To address these, we propose a difficulty-aware hierarchical training framework. First, we introduce two unsupervised difficulty estimation strategies—Progressive Image Semantic Masking (PISM) and Cross-Modal Attention Balancing (CMAB)—to enable quantitative, difficulty-based sample selection. Second, we design a hybrid SFT+GRPO training paradigm dominated by Generalized Reinforcement Learning from Policy Optimization (GRPO), eliminating the need for supervised fine-tuning. Extensive experiments across six mainstream multimodal benchmarks demonstrate that our method significantly improves reasoning accuracy over conventional pipelines, while simultaneously removing the supervised fine-tuning stage—achieving both superior effectiveness and training efficiency.

Technology Category

Application Category

📝 Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at https://github.com/qijianyu277/DifficultySampling.

Problem

Research questions and friction points this paper is trying to address.

Lack quantifiable difficulty metrics for multimodal post-training sample selection

Suboptimal paradigms fail to jointly optimize perception and reasoning capabilities

Need strategic data sampling methods to improve multimodal reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Image Semantic Masking quantifies sample hardness

Cross-Modality Attention Balance assesses interaction complexity

Hierarchical training framework uses difficulty-stratified sampling strategies

🔎 Similar Papers

No similar papers found.