🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited performance on fine-grained localization and reasoning over high-resolution images, primarily due to resolution mismatch between training and inference: fixed-resolution fine-tuning harms generalization, while direct downsampling discards critical visual details. To address this, we propose a training-free, task-agnostic two-stage enhancement framework. First, coarse-grained predictions are generated from downsampled images, implicitly identifying candidate regions of interest. Second, these regions are refined via localized reasoning on the original high-resolution image. Crucially, our method is the first to leverage implicit spatial cues from coarse predictions to adaptively integrate high-resolution details—enabling robust cross-task and cross-resolution generalization. Evaluated on 4K GUI localization and 4K/8K multimodal perception benchmarks, it achieves absolute improvements of +21.3%, +5.8%, and +5.2% over strong baselines, consistently outperforming prior approaches.
📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at https://github.com/yenncye/ECP.