🤖 AI Summary
Pretrained vision-language models (VLMs) exhibit limited robustness and poor generalization across tasks and visual features in human-level visual cognition—particularly intuitive physics and causal reasoning. To address this, we construct the first standardized visual stimulus set spanning multiple cognitive domains, accompanied by human behavioral annotations, and perform supervised fine-tuning on representative VLMs (e.g., CLIP, Flamingo). We systematically evaluate behavioral alignment and cross-domain generalization along dimensions including physical intuition and causal inference. Results show that fine-tuning significantly improves model performance on target cognitive tasks and alignment with human responses; however, gains fail to transfer to unseen visual features or heterogeneous cognitive tasks—revealing a fundamental domain specificity limitation in current VLMs. This work establishes the first cross-cognitive-domain benchmark for visual cognition evaluation, providing a novel paradigm and empirical foundation for advancing cognitive interpretability and human-like reasoning in VLMs.
📝 Abstract
Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.