MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models

📅 2025-07-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address image alignment bias and hallucination in large vision-language models (LVLMs) caused by misaligned multimodal features, this paper proposes a novel method jointly modeling sequential order and 2D spatial position. First, it identifies an imbalance in spatial perception induced by rotary position embeddings under long-range decay. Second, it introduces a 2D multi-directional spatial decay mechanism and integrates Manhattan-distance-based causal attention to achieve more precise spatial-semantic alignment between instruction and image tokens. The approach requires no additional parameters. Evaluated on multiple hallucination detection and general multimodal benchmarks, it significantly reduces hallucination rates while enhancing cross-modal alignment robustness and generalization. By explicitly encoding spatial structure through interpretable, parameter-efficient mechanisms, the method establishes a new, scalable paradigm for positional modeling in LVLMs.

Technology Category

Application Category

📝 Abstract
Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.
Problem

Research questions and friction points this paper is trying to address.

Reducing hallucinations in Large Vision-Language Models
Addressing multimodal feature misalignment in LVLMs
Mitigating image alignment bias in positional modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Manhattan distance for spatial decay
Extends RoPE to 2D multi-directional decay
Integrates 1D sequence with 2D positions
🔎 Similar Papers
No similar papers found.