π€ AI Summary
Multimodal pretraining is vulnerable to descriptive bias in image-caption pairs, causing models to over-rely on superficial textual cues while undermining visual reasoning capabilities. To address this, we propose a reinforcement learning (RL)-driven visual grounding pretraining framework. First, cross-modal attention is leveraged to estimate visual dependency strength, enabling dynamic masking of highly dependent visual patches. Subsequently, visual grounding is formulated as an RL task, with a joint semantic-visual consistency reward function designed to guide policy optimization. This work represents the first direct integration of RL into the pretraining stage of multimodal large language models. Experiments demonstrate substantial improvements across multiple zero-shot transfer benchmarks; fine-tuned models further exhibit enhanced robustness. These results validate the frameworkβs effectiveness in strengthening deep visual understanding and out-of-distribution generalization.
π Abstract
Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.