A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit limited performance on fine-grained localization and reasoning over high-resolution images, primarily due to resolution mismatch between training and inference: fixed-resolution fine-tuning harms generalization, while direct downsampling discards critical visual details. To address this, we propose a training-free, task-agnostic two-stage enhancement framework. First, coarse-grained predictions are generated from downsampled images, implicitly identifying candidate regions of interest. Second, these regions are refined via localized reasoning on the original high-resolution image. Crucially, our method is the first to leverage implicit spatial cues from coarse predictions to adaptively integrate high-resolution details—enabling robust cross-task and cross-resolution generalization. Evaluated on 4K GUI localization and 4K/8K multimodal perception benchmarks, it achieves absolute improvements of +21.3%, +5.8%, and +5.2% over strong baselines, consistently outperforming prior approaches.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at https://github.com/yenncye/ECP.

Problem

Research questions and friction points this paper is trying to address.

Enhancing MLLM performance on high-resolution images

Addressing fine-grained localization in high-resolution images

Mitigating train-test resolution discrepancy in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free, task-agnostic framework for MLLMs

Two-stage Extract Candidate then Predict (ECP) method

Preserves fine-grained details in high-resolution images

🔎 Similar Papers

AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity