🤖 AI Summary
This work addresses the challenge that vision-language models, constrained by limited perceptual bandwidth under wide-field views, often lose fine-grained details and struggle with complex reasoning. To overcome this, the paper formulates visual perception as a sequential decision-making process and introduces, for the first time, the principles of active vision and information foraging into this domain. It proposes a training-agnostic Sequential Bayesian Optimal Experimental Design (S-BOED) approximation framework that dynamically balances spatial coverage and resolution. The approach supports diverse optimization strategies—such as greedy sampling and lookahead planning—and naturally extends to multi-tool visual agents. Evaluated on gigapixel-scale benchmarks, the method significantly outperforms current state-of-the-art models and standard baselines, substantially narrowing the performance gap with human oracles.
📝 Abstract
Visual perception in modern Vision-Language Models (VLMs) is constrained by a fundamental perceptual bandwidth bottleneck: a broad field of view inevitably sacrifices the fine-grained details necessary for complex reasoning. Inspired by the classical paradigms of active vision and information foraging, we frame overcoming this limitation as a sequential decision-making process. We formalise this process through the lens of the sequential Bayesian optimal experimental design (S-BOED) problem. While exact Bayesian inference is intractable in continuous gigapixel spaces, we derive principled yet tractable approximations that balance spatial coverage against resolution. To validate this framework, we present a training-free inference strategy as a practical instantiation of the S-BOED objective for agents equipped with multiple vision tools. Designed as a flexible template, this strategy accommodates arbitrary optimisation algorithms, ranging from efficient greedy sampling to look-ahead planning, to approximate the optimal design. Empirical evaluations on gigapixel-level benchmarks demonstrate that our approach further boosts the performance of state-of-the-art models, significantly outperforming standard baselines and effectively narrowing the gap towards human-annotated oracles.