🤖 AI Summary
To address image alignment bias and hallucination in large vision-language models (LVLMs) caused by misaligned multimodal features, this paper proposes a novel method jointly modeling sequential order and 2D spatial position. First, it identifies an imbalance in spatial perception induced by rotary position embeddings under long-range decay. Second, it introduces a 2D multi-directional spatial decay mechanism and integrates Manhattan-distance-based causal attention to achieve more precise spatial-semantic alignment between instruction and image tokens. The approach requires no additional parameters. Evaluated on multiple hallucination detection and general multimodal benchmarks, it significantly reduces hallucination rates while enhancing cross-modal alignment robustness and generalization. By explicitly encoding spatial structure through interpretable, parameter-efficient mechanisms, the method establishes a new, scalable paradigm for positional modeling in LVLMs.
📝 Abstract
Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.