Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of computationally predicting reaction time in scene understanding. We propose Foveated Scene Understanding Map (F-SUM), the first computable framework integrating foveated vision modeling with vision-language models (VLMs). F-SUM simulates oculomotor constraints to generate a spatial map of scene understanding difficulty over the image, then aggregates it into a global score. Its core innovation lies in explicitly coupling task-relevant information distribution with the foveal resolution decay profile—overcoming key limitations of conventional image complexity metrics. Experiments demonstrate that F-SUM scores correlate significantly with human behavioral measures: mean reaction time (r = 0.47), fixation count (r = 0.51), and time-limited description accuracy (r = −0.56). F-SUM consistently outperforms existing baselines, establishing a new paradigm for cognitive modeling of scene understanding and human–AI interaction evaluation.

Technology Category

Application Category

📝 Abstract
Although models exist that predict human response times (RTs) in tasks such as target search and visual discrimination, the development of image-computable predictors for scene understanding time remains an open challenge. Recent advances in vision-language models (VLMs), which can generate scene descriptions for arbitrary images, combined with the availability of quantitative metrics for comparing linguistic descriptions, offer a new opportunity to model human scene understanding. We hypothesize that the primary bottleneck in human scene understanding and the driving source of variability in response times across scenes is the interaction between the foveated nature of the human visual system and the spatial distribution of task-relevant visual information within an image. Based on this assumption, we propose a novel image-computable model that integrates foveated vision with VLMs to produce a spatially resolved map of scene understanding as a function of fixation location (Foveated Scene Understanding Map, or F-SUM), along with an aggregate F-SUM score. This metric correlates with average (N=17) human RTs (r=0.47) and number of saccades (r=0.51) required to comprehend a scene (across 277 scenes). The F-SUM score also correlates with average (N=16) human description accuracy (r=-0.56) in time-limited presentations. These correlations significantly exceed those of standard image-based metrics such as clutter, visual complexity, and scene ambiguity based on language entropy. Together, our work introduces a new image-computable metric for predicting human response times in scene understanding and demonstrates the importance of foveated visual processing in shaping comprehension difficulty.
Problem

Research questions and friction points this paper is trying to address.

Predict human scene understanding time using foveated vision and VLMs
Model interaction between foveated vision and task-relevant visual information
Develop image-computable metric correlating with human response times and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates foveated vision with vision-language models
Produces spatially resolved scene understanding maps
Correlates with human response times and accuracy
🔎 Similar Papers
No similar papers found.
Ziqi Wen
Ziqi Wen
UC Santa Barbara
Vision
Jonathan Skaza
Jonathan Skaza
University of California, Santa Barbara
NeuroAIMachine PerceptionComputational NeuroscienceArtificial IntelligenceComputer Vision
Shravan Murlidaran
Shravan Murlidaran
University of California, Santa Barbara
Deep LearningHuman VisionMachine VisionComputational Cognitive ScienceCognitive Psychology
W
William Y. Wang
Department of Computer Science, University of California, Santa Barbara
M
Miguel P. Eckstein
Department of Computer Science, University of California, Santa Barbara; Graduate Program in Dynamical Neuroscience, University of California, Santa Barbara; Department of Psychological and Brain Sciences, University of California, Santa Barbara