Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

πŸ“… 2026-03-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the prevalent hallucination issues in current large vision-language models when handling multi-image tasks, which stem primarily from the locality of attention mechanisms and insufficient cross-image modeling capabilities. To mitigate this, the authors propose a co-optimization framework that integrates architectural and training-level innovations. At the architecture level, an optional image-token interaction attention mechanism is introduced to enable fine-grained cross-image alignment. At the training level, a cross-image contrastive preference learning strategy is designed to reinforce the model’s reliance on authentic visual evidence. This approach represents the first effort to jointly suppress multi-image hallucinations through coordinated structural and objective design, achieving significant performance gains across diverse multi-image tasks while maintaining or slightly improving single-image task performance, thereby demonstrating strong generalization capability.

Technology Category

Application Category

πŸ“ Abstract
Although large vision-language models (LVLMs) have demonstrated remarkable capabilities, they are prone to hallucinations in multi-image tasks. We attribute this issue to limitations in existing attention mechanisms and insufficient cross-image modeling. Inspired by this, we propose a structured hallucination mitigation framework involving Cross-Image Attention calibration and Preference Learning (CAPL). CAPL explicitly enhances inter-image interactions at the architectural level while reinforcing reliance on genuine cross-image evidence during training, thereby improving the model's perception and modeling of cross-image associations. Specifically, we (i) introduce a selectable image token interaction attention mechanism to establish fine-grained cross-image entity alignment and information flow; (ii) design a cross-image modeling-based preference optimization strategy that contrasts reasoning outcomes under full inter-image interaction and those obtained when images are mutually invisible, encouraging the model to ground its predictions in authentic visual evidence and mitigating erroneous inferences driven by textual priors. Experimental results demonstrate that CAPL consistently improves performance across multiple model architectures, achieving stable gains on both multi-image hallucination and general benchmarks. Notably, performance on single-image visual tasks remains stable or slightly improves, indicating strong generalization capability.
Problem

Research questions and friction points this paper is trying to address.

multi-image hallucination
vision-language models
cross-image modeling
attention mechanism
hallucination mitigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-image attention
hallucination mitigation
preference learning
vision-language models
attention calibration
πŸ”Ž Similar Papers
No similar papers found.