🤖 AI Summary
This work identifies a pervasive positional bias in large vision-language models (LVLMs) under multi-image scenarios: models exhibit stronger reasoning for images at the sequence start/end positions but significantly weaker performance for middle-position images—observed across both open-source and proprietary LVLMs, albeit with distinct bias patterns. We formally define this phenomenon and introduce Position-wise Question Answering (PQA), the first benchmark to quantitatively measure such positional bias. To mitigate it, we propose SoFt Attention (SoFA), a training-free, plug-and-play mechanism that modulates attention via linear interpolation between causal and bidirectional attention, enabling position-aware inference. Integrated as an inference-time enhancement, SoFA consistently reduces positional bias across diverse LVLMs, notably improving accuracy on middle-position image understanding—without any parameter updates or training overhead. Our approach establishes a new paradigm for robust multi-image reasoning in LVLMs.
📝 Abstract
The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.