MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Multimodal pretraining is vulnerable to descriptive bias in image-caption pairs, causing models to over-rely on superficial textual cues while undermining visual reasoning capabilities. To address this, we propose a reinforcement learning (RL)-driven visual grounding pretraining framework. First, cross-modal attention is leveraged to estimate visual dependency strength, enabling dynamic masking of highly dependent visual patches. Subsequently, visual grounding is formulated as an RL task, with a joint semantic-visual consistency reward function designed to guide policy optimization. This work represents the first direct integration of RL into the pretraining stage of multimodal large language models. Experiments demonstrate substantial improvements across multiple zero-shot transfer benchmarks; fine-tuned models further exhibit enhanced robustness. These results validate the framework’s effectiveness in strengthening deep visual understanding and out-of-distribution generalization.

Technology Category

Application Category

📝 Abstract

Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Addresses descriptive bias in multimodal pre-training models

Enhances visual reasoning over surface linguistic cues

Introduces reinforcement learning for vision-grounded pre-training objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked multimodal data construction via visual dependency estimation

Reinforcement learning integration for vision-grounded reward signals

Semantic-visual reward-guided reconstruction of vision-dependent spans

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs