Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

237K/year
🤖 AI Summary
This work addresses the “lazy perception” exhibited by vision-language models in high-resolution scenarios, where reliance on global visual context and linguistic priors often leads to underutilization of active perceptual operations such as zooming or cropping. To counter this, the authors propose a “perceptual hunger” training paradigm that dynamically restricts the number of visual tokens available at each inference step during standard post-training, compelling the model to operate under extremely limited visual bandwidth. This constraint alone—without additional loss terms, reward shaping, or architectural modifications—induces the emergence of multi-step active perception strategies. The resulting models learn functional visual search behaviors, achieving an average relative performance gain of 5% across multiple benchmarks and demonstrating markedly improved capabilities in tasks requiring deliberate visual exploration.
📝 Abstract
Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth -- restricting each observation to a tight token budget so that no single view suffices for task completion, making active perception the only viable strategy. Despite requiring no auxiliary losses, reward shaping, or architectural changes -- serving as a minimal, plug-in modification to standard post-training pipelines -- models trained under perceptual starvation achieve substantial gains of 5% average relative improvement across diverse benchmarks.
Problem

Research questions and friction points this paper is trying to address.

active perception
vision-language models
lazy perception
visual bandwidth
multi-step visual search
Innovation

Methods, ideas, or system contributions that make the work stand out.

active perception
visual bandwidth constraint
vision-language models
perceptual starvation
lazy perception