🤖 AI Summary
This work addresses the “lazy perception” exhibited by vision-language models in high-resolution scenarios, where reliance on global visual context and linguistic priors often leads to underutilization of active perceptual operations such as zooming or cropping. To counter this, the authors propose a “perceptual hunger” training paradigm that dynamically restricts the number of visual tokens available at each inference step during standard post-training, compelling the model to operate under extremely limited visual bandwidth. This constraint alone—without additional loss terms, reward shaping, or architectural modifications—induces the emergence of multi-step active perception strategies. The resulting models learn functional visual search behaviors, achieving an average relative performance gain of 5% across multiple benchmarks and demonstrating markedly improved capabilities in tasks requiring deliberate visual exploration.
📝 Abstract
Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth -- restricting each observation to a tight token budget so that no single view suffices for task completion, making active perception the only viable strategy. Despite requiring no auxiliary losses, reward shaping, or architectural changes -- serving as a minimal, plug-in modification to standard post-training pipelines -- models trained under perceptual starvation achieve substantial gains of 5% average relative improvement across diverse benchmarks.