Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work identifies a pervasive positional bias in large vision-language models (LVLMs) under multi-image scenarios: models exhibit stronger reasoning for images at the sequence start/end positions but significantly weaker performance for middle-position images—observed across both open-source and proprietary LVLMs, albeit with distinct bias patterns. We formally define this phenomenon and introduce Position-wise Question Answering (PQA), the first benchmark to quantitatively measure such positional bias. To mitigate it, we propose SoFt Attention (SoFA), a training-free, plug-and-play mechanism that modulates attention via linear interpolation between causal and bidirectional attention, enabling position-aware inference. Integrated as an inference-time enhancement, SoFA consistently reduces positional bias across diverse LVLMs, notably improving accuracy on middle-position image understanding—without any parameter updates or training overhead. Our approach establishes a new paradigm for robust multi-image reasoning in LVLMs.

Technology Category

Application Category

📝 Abstract

The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.

Problem

Research questions and friction points this paper is trying to address.

LVLMs struggle with multi-image reasoning due to position bias.

Position-wise Question Answering (PQA) quantifies reasoning bias in LVLMs.

SoFt Attention (SoFA) mitigates position bias in LVLMs without training.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Position-wise Question Answering (PQA) task

Proposes SoFt Attention (SoFA) for bias mitigation

Uses linear interpolation in attention mechanisms

🔎 Similar Papers

Can We Talk Models Into Seeing the World Differently?