Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work proposes an autoregressive vision-language framework that unifies foveated attention and reasoning within a single decoding process. While high-resolution images enhance vision-language model performance, their use is hindered by strict visual token budgets that impede efficient inference. The proposed method dynamically determines during generation whether and where to attend to high-resolution regions, immediately injecting the acquired fine-grained evidence into the current reasoning trajectory. A state-aware action mechanism enables on-demand foveation, circumventing costly global high-resolution processing. The model is trained via a two-stage strategy: first, supervised learning cold-starts the foveation behavior, followed by reinforcement learning to jointly optimize evidence acquisition and task accuracy. Experiments demonstrate that under stringent token constraints, the model learns efficient foveation policies and achieves significant gains in reasoning accuracy across multiple vision-language benchmarks.

Technology Category

Application Category

📝 Abstract

Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.

Problem

Research questions and friction points this paper is trying to address.

foveation

vision-language models

visual-token budget

high-resolution images

compute overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

foveated reasoning

vision-language models

visual foveation