Pixels, Patterns, but No Poetry: To See The World like Humans

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work investigates whether multimodal large language models (MLLMs) possess human-like visual perception capabilities, revealing catastrophic failures on synthetic image tasks solvable by human intuition. To address this, we introduce the Turing Eye Test (TET), the first perception-oriented benchmark comprising four diagnostic tasks that require no linguistic reasoning and specifically probe low-level visual intuition. Diverging from dominant reasoning-centric evaluation paradigms, TET shifts focus to foundational visual generalization ability. Empirical analysis demonstrates that performance bottlenecks stem primarily from the vision encoder—not the language model—and that conventional full-model fine-tuning is ineffective; only vision-tower adaptation yields substantial gains. Our findings indicate that current MLLMs exhibit severe deficits in visual generality, underscoring a critical gap in visual representation learning.

Technology Category

Application Category

📝 Abstract

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.

Problem

Research questions and friction points this paper is trying to address.

Evaluating human-like perception in Multimodal Large Language Models

Assessing MLLMs' performance on intuitive human visual tasks

Identifying gaps in vision tower generalization versus language reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Turing Eye Test for perception evaluation

Uses synthetic images for intuitive human-like tasks

Fine-tunes vision tower to improve adaptation

🔎 Similar Papers

Achieving more human brain-like vision via human EEG representational alignment