Active Reasoning Vision-Language Models via Sequential Experimental Design

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the challenge that vision-language models, constrained by limited perceptual bandwidth under wide-field views, often lose fine-grained details and struggle with complex reasoning. To overcome this, the paper formulates visual perception as a sequential decision-making process and introduces, for the first time, the principles of active vision and information foraging into this domain. It proposes a training-agnostic Sequential Bayesian Optimal Experimental Design (S-BOED) approximation framework that dynamically balances spatial coverage and resolution. The approach supports diverse optimization strategies—such as greedy sampling and lookahead planning—and naturally extends to multi-tool visual agents. Evaluated on gigapixel-scale benchmarks, the method significantly outperforms current state-of-the-art models and standard baselines, substantially narrowing the performance gap with human oracles.

📝 Abstract

Visual perception in modern Vision-Language Models (VLMs) is constrained by a fundamental perceptual bandwidth bottleneck: a broad field of view inevitably sacrifices the fine-grained details necessary for complex reasoning. Inspired by the classical paradigms of active vision and information foraging, we frame overcoming this limitation as a sequential decision-making process. We formalise this process through the lens of the sequential Bayesian optimal experimental design (S-BOED) problem. While exact Bayesian inference is intractable in continuous gigapixel spaces, we derive principled yet tractable approximations that balance spatial coverage against resolution. To validate this framework, we present a training-free inference strategy as a practical instantiation of the S-BOED objective for agents equipped with multiple vision tools. Designed as a flexible template, this strategy accommodates arbitrary optimisation algorithms, ranging from efficient greedy sampling to look-ahead planning, to approximate the optimal design. Empirical evaluations on gigapixel-level benchmarks demonstrate that our approach further boosts the performance of state-of-the-art models, significantly outperforming standard baselines and effectively narrowing the gap towards human-annotated oracles.

Problem

Research questions and friction points this paper is trying to address.

perceptual bandwidth bottleneck

Vision-Language Models

fine-grained details

visual perception

complex reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential Bayesian Optimal Experimental Design

Active Vision

Vision-Language Models