MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

πŸ“… 2025-12-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Multimodal pretraining is vulnerable to descriptive bias in image-caption pairs, causing models to over-rely on superficial textual cues while undermining visual reasoning capabilities. To address this, we propose a reinforcement learning (RL)-driven visual grounding pretraining framework. First, cross-modal attention is leveraged to estimate visual dependency strength, enabling dynamic masking of highly dependent visual patches. Subsequently, visual grounding is formulated as an RL task, with a joint semantic-visual consistency reward function designed to guide policy optimization. This work represents the first direct integration of RL into the pretraining stage of multimodal large language models. Experiments demonstrate substantial improvements across multiple zero-shot transfer benchmarks; fine-tuned models further exhibit enhanced robustness. These results validate the framework’s effectiveness in strengthening deep visual understanding and out-of-distribution generalization.

Technology Category

Application Category

πŸ“ Abstract
Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.
Problem

Research questions and friction points this paper is trying to address.

Addresses descriptive bias in multimodal pre-training models
Enhances visual reasoning over surface linguistic cues
Introduces reinforcement learning for vision-grounded pre-training objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked multimodal data construction via visual dependency estimation
Reinforcement learning integration for vision-grounded reward signals
Semantic-visual reward-guided reconstruction of vision-dependent spans
πŸ”Ž Similar Papers
No similar papers found.
X
Xuhui Zheng
SenseTime, Nanjing University
K
Kang An
SenseTime, Shenzhen University
Z
Ziliang Wang
SenseTime
Y
Yuhang Wang
SenseTime
F
Faqiang Qian
SenseTime
Yichao Wu
Yichao Wu
SenseTime Group Limited
AGILLMComputer VisionFace recognition