Testing the limits of fine-tuning to improve reasoning in vision language models

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Pretrained vision-language models (VLMs) exhibit limited robustness and poor generalization across tasks and visual features in human-level visual cognition—particularly intuitive physics and causal reasoning. To address this, we construct the first standardized visual stimulus set spanning multiple cognitive domains, accompanied by human behavioral annotations, and perform supervised fine-tuning on representative VLMs (e.g., CLIP, Flamingo). We systematically evaluate behavioral alignment and cross-domain generalization along dimensions including physical intuition and causal inference. Results show that fine-tuning significantly improves model performance on target cognitive tasks and alignment with human responses; however, gains fail to transfer to unseen visual features or heterogeneous cognitive tasks—revealing a fundamental domain specificity limitation in current VLMs. This work establishes the first cross-cognitive-domain benchmark for visual cognition evaluation, providing a novel paradigm and empirical foundation for advancing cognitive interpretability and human-like reasoning in VLMs.

Technology Category

Application Category

📝 Abstract

Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

Problem

Research questions and friction points this paper is trying to address.

Enhancing visual cognition in models

Aligning models with human behavior

Limiting fine-tuning for robust generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning vision language models

Incorporates human visual cognition tasks

Enhances model alignment with humans

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling