🤖 AI Summary
Large Vision-Language Models (VLMs) incur substantial computational overhead and latency due to high-resolution image inputs, where visual tokens constitute 97–99% of total tokens—despite most downstream tasks requiring far lower resolution. To address this inefficiency, we propose a context-aware dynamic resolution selection framework: (1) a lightweight preprocessor predicts the minimal effective resolution required per input; (2) convergence-aware criteria unify discrete training with continuous inference; and (3) classification is learned via multi-resolution response consistency, enabling continuous interpolation at inference time. Evaluated on five mainstream multimodal benchmarks across multiple compact VLM architectures, our method reduces visual computation by up to 80% with no performance degradation. Our core contribution is the first formulation of resolution selection as a learnable, generalizable, context-aware decision process—enabling adaptive, input-dependent resolution control without architectural modification or fine-tuning.
📝 Abstract
Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce emph{CARES}-a extbf{C}ontext- extbf{A}ware extbf{R}esolution extbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.