🤖 AI Summary
Large vision-language models (LVLMs) suffer significant performance degradation in multi-image understanding due to cross-image information leakage—unintended visual cue contamination across images in a batch. This paper provides the first systematic characterization of this issue and proposes FOCUS, a training-free, architecture-agnostic decoding strategy. FOCUS mitigates visual cue mixing during inference by (i) applying random noise masking to each image individually, (ii) performing independent per-image reasoning, (iii) aggregating logits, and (iv) refining predictions via noise-reference contrastive calibration. Crucially, FOCUS modifies only the decoding process—leaving model architecture and parameters unchanged. Evaluated across four major multi-image benchmarks—including MMBench-MultiImage and Multi-Image MMLU—and diverse LVLMs (e.g., Qwen-VL, LLaVA-OneVision, InternVL), FOCUS consistently improves accuracy by 4.2–8.7 percentage points, demonstrating its effectiveness, broad applicability, and strong generalization across models and tasks.
📝 Abstract
Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.