Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Multimodal large language models (MLLMs) frequently exhibit fine-grained perceptual grounding errors in complex scenes due to insufficient attention to small-scale details and spatial relationships. To address this, we propose an attention-guided differentiable image warping method: at inference time, it performs parameter-free, axis-aligned remapping of the input image based on cross-modal attention heatmaps, dynamically enhancing resolution in salient regions while fully preserving global context and original pixel fidelity. Crucially, our approach modifies only the visual input distribution—without altering the model architecture or introducing trainable parameters. Evaluated across five fine-grained understanding benchmarks and four state-of-the-art MLLMs, it consistently improves accuracy and compositional reasoning performance, outperforming four categories of image preprocessing baselines and effectively mitigating hallucination.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.

Problem

Research questions and friction points this paper is trying to address.

Enhances MLLMs' perception of small details in cluttered visual scenes

Improves spatial reasoning and reduces hallucinations through attention-guided warping

Redistributes image resolution to query-relevant areas without modifying model architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-guided warping redistributes image resolution

Method preserves global context while enhancing details

Uses cross-modal attention without changing model weights

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs