Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Large vision-language models (LVLMs) suffer significant performance degradation in multi-image understanding due to cross-image information leakage—unintended visual cue contamination across images in a batch. This paper provides the first systematic characterization of this issue and proposes FOCUS, a training-free, architecture-agnostic decoding strategy. FOCUS mitigates visual cue mixing during inference by (i) applying random noise masking to each image individually, (ii) performing independent per-image reasoning, (iii) aggregating logits, and (iv) refining predictions via noise-reference contrastive calibration. Crucially, FOCUS modifies only the decoding process—leaving model architecture and parameters unchanged. Evaluated across four major multi-image benchmarks—including MMBench-MultiImage and Multi-Image MMLU—and diverse LVLMs (e.g., Qwen-VL, LLaVA-OneVision, InternVL), FOCUS consistently improves accuracy by 4.2–8.7 percentage points, demonstrating its effectiveness, broad applicability, and strong generalization across models and tasks.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.

Problem

Research questions and friction points this paper is trying to address.

Mitigates cross-image information leakage in LVLMs

Addresses performance degradation in multi-image tasks

Prevents visual cue entanglement across different images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free decoding strategy mitigates leakage

Sequentially masks images with random noise

Aggregates and contrastively refines logits

🔎 Similar Papers

No similar papers found.