A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited performance on fine-grained localization and reasoning over high-resolution images, primarily due to resolution mismatch between training and inference: fixed-resolution fine-tuning harms generalization, while direct downsampling discards critical visual details. To address this, we propose a training-free, task-agnostic two-stage enhancement framework. First, coarse-grained predictions are generated from downsampled images, implicitly identifying candidate regions of interest. Second, these regions are refined via localized reasoning on the original high-resolution image. Crucially, our method is the first to leverage implicit spatial cues from coarse predictions to adaptively integrate high-resolution details—enabling robust cross-task and cross-resolution generalization. Evaluated on 4K GUI localization and 4K/8K multimodal perception benchmarks, it achieves absolute improvements of +21.3%, +5.8%, and +5.2% over strong baselines, consistently outperforming prior approaches.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at https://github.com/yenncye/ECP.
Problem

Research questions and friction points this paper is trying to address.

Enhancing MLLM performance on high-resolution images
Addressing fine-grained localization in high-resolution images
Mitigating train-test resolution discrepancy in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free, task-agnostic framework for MLLMs
Two-stage Extract Candidate then Predict (ECP) method
Preserves fine-grained details in high-resolution images
🔎 Similar Papers
Jaeseong Lee
Jaeseong Lee
KAIST
Deep LearningComputer VisionComputer Graphics
Y
Yeeun Choi
Yonsei University
H
Heechan Choi
Yonsei University
H
Hanjung Kim
Yonsei University
S
Seonjoo Kim
Yonsei University