HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit degraded performance on high-resolution images—not primarily due to difficulty in recognizing small objects, but rather because complex backgrounds disrupt visual attention allocation. This work identifies background interference as the fundamental bottleneck and proposes HiDe, a training-free hierarchical decoupling framework. HiDe introduces Token-wise Attention Decoupling (TAD) to precisely localize task-critical visual tokens, and Layout-Preserving Decoupling (LPD) to separate foreground objects from background while strictly preserving spatial layout—enabling efficient visual representation reconstruction. Evaluated on V*Bench and HRBench4K/8K, HiDe achieves state-of-the-art results: 92.1% accuracy with Qwen2.5-VL 7B and 91.6% with InternVL3 8B, while reducing memory overhead by 75%. This is the first study to explicitly diagnose background interference as the core limitation and to address it via structure-aware, token-level visual decoupling without fine-tuning.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://github.com/Tennine2077/HiDe.

Problem

Research questions and friction points this paper is trying to address.

Addresses background interference in high-resolution MLLM visual understanding

Proposes hierarchical decoupling to isolate key visual regions from backgrounds

Enables training-free precise alignment while reducing memory consumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Decoupling Framework eliminates background interference

Token-wise Attention Decoupling aligns tokens with target regions

Layout-Preserving Decoupling reconstructs compact spatial representations

🔎 Similar Papers

AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity