Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models exhibit limited generalization in multimodal reasoning, primarily due to overreliance on text-summarized image representations and insufficient joint modeling of visual context and commonsense knowledge. To address this, we propose the “Context- and Commonsense-guided Masked Prediction” (MPCC) training paradigm: by masking image regions and requiring the model to reconstruct them through explicit integration of local visual context and external commonsense knowledge, MPCC explicitly enhances cross-modal reasoning capabilities. We further introduce the MPCC Eval benchmark for systematic evaluation and a prior-guided reinforcement fine-tuning strategy to improve out-of-distribution robustness and cross-task generalization. Experiments demonstrate that MPCC significantly outperforms strong baselines across diverse vision-language reasoning tasks, achieving up to 12.7% improvement under distribution shift and task transfer settings. These results validate that commonsense-informed visual reconstruction effectively promotes general-purpose multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a significant gap persists in their adaptation to real world multimodal scenarios, most notably, vision language tasks, due to a heavy focus on single modal language settings. While efforts to transplant reinforcement learning techniques from NLP to VLMs have emerged, these approaches often remain confined to perception centric tasks or reduce images to textual summaries, failing to fully exploit visual context and commonsense knowledge, ultimately constraining the generalization of reasoning capabilities across diverse multimodal environments. To address this limitation, we introduce a novel fine tuning task, Masked Prediction via Context and Commonsense, which forces models to integrate visual context and commonsense reasoning by reconstructing semantically meaningful content from occluded images, thereby laying the foundation for generalized reasoning. To systematically evaluate the model performance in generalized reasoning, we developed a specialized evaluation benchmark, MPCC Eval, and employed various fine tuning strategies to guide reasoning. Among these, we introduced an innovative training method, Reinforcement Fine tuning with Prior Sampling, which not only enhances model performance but also improves its generalized reasoning capabilities in OOD and cross task scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addressing multimodal reasoning gaps in vision-language models
Enhancing visual context integration and commonsense reasoning capabilities
Improving generalization of reasoning across diverse multimodal environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked prediction task integrates visual context and commonsense
Reinforcement fine-tuning with prior sampling enhances generalization
Specialized MPCC Eval benchmark systematically assesses reasoning capabilities
🔎 Similar Papers
No similar papers found.
J
Jiaao Yu
School of Computer Science and Technology, East China Normal University
S
Shenwei Li
School of Computer Science and Technology, East China Normal University
M
Mingjie Han
School of Computer Science and Technology, East China Normal University
Y
Yifei Yin
School of Computer Science and Technology, East China Normal University
W
Wenzheng Song
School of Computer Science and Technology, East China Normal University
C
Chenghao Jia
School of Computer Science and Technology, East China Normal University
Man Lan
Man Lan
East China Normal University,School of Computer Science and Technology
NLP