CARES: Context-Aware Resolution Selector for VLMs

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Large Vision-Language Models (VLMs) incur substantial computational overhead and latency due to high-resolution image inputs, where visual tokens constitute 97–99% of total tokens—despite most downstream tasks requiring far lower resolution. To address this inefficiency, we propose a context-aware dynamic resolution selection framework: (1) a lightweight preprocessor predicts the minimal effective resolution required per input; (2) convergence-aware criteria unify discrete training with continuous inference; and (3) classification is learned via multi-resolution response consistency, enabling continuous interpolation at inference time. Evaluated on five mainstream multimodal benchmarks across multiple compact VLM architectures, our method reduces visual computation by up to 80% with no performance degradation. Our core contribution is the first formulation of resolution selection as a learnable, generalizable, context-aware decision process—enabling adaptive, input-dependent resolution control without architectural modification or fine-tuning.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce emph{CARES}-a extbf{C}ontext- extbf{A}ware extbf{R}esolution extbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

Problem

Research questions and friction points this paper is trying to address.

Reduces excessive visual tokens in VLMs to lower computation costs

Predicts minimal sufficient image resolution for accurate VLM responses

Maintains task performance while cutting computational requirements by 80%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight module predicts minimal sufficient image resolution

Uses compact VLM to extract features before processing

Reduces compute by 80% while maintaining performance

🔎 Similar Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning