Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

📅 2026-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the trade-off between accuracy and computational efficiency in vision-language models when processing high-resolution images. To this end, the authors propose AwaRes, a novel framework that introduces a spatial on-demand processing mechanism: leveraging a low-resolution global view, the model adaptively determines whether to invoke an external tool to retrieve critical high-resolution regions, thereby acquiring fine-grained information only when necessary. The approach is optimized through a combination of cold-start supervised fine-tuning (SFT) and multi-turn GRPO, guided by a composite reward balancing semantic accuracy against cropping cost. Supervisory signals are automatically generated using a discriminator and an Oracle model. This method significantly reduces computational overhead while effectively enhancing both reasoning accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes
Problem

Research questions and friction points this paper is trying to address.

vision-language models
accuracy-efficiency trade-off
high-resolution crops
computational efficiency
visual information loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial-on-demand
high-resolution crops retrieval
vision-language models
tool-calling
multi-turn GRPO
🔎 Similar Papers
No similar papers found.