Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study investigates the origins of human eye movement patterns during free viewing and proposes that they emerge as a natural byproduct of optimizing scene understanding under foveal visual constraints. To test this hypothesis, we developed a computational agent equipped with a foveated visual system and trained it—via reinforcement learning or self-supervised strategies—to perform scene understanding tasks without any exposure to human gaze data. Remarkably, the agent spontaneously developed fixation behaviors highly consistent with those of humans, significantly outperforming control models explicitly designed for search or classification tasks. This work provides the first computational modeling evidence establishing an intrinsic link between gaze patterns and perceptual goals, suggesting that human-like fixations arise not from task-specific tuning but from general principles of efficient visual processing under biological constraints.

📝 Abstract

When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.

Problem

Research questions and friction points this paper is trying to address.

human fixation patterns

scene understanding

foveated vision

free-viewing

perceptual task

Innovation

Methods, ideas, or system contributions that make the work stand out.

foveated vision

scene understanding

human-like fixation

emergent behavior

computational modeling

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2024-02-09European Conference on Computer VisionCitations: 29