When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the susceptibility of vision-language models to hallucination in high-stakes scenarios, where they erroneously generate content absent from input images. The study reveals a geometric root cause: visual representations become overly aligned with the textual manifold, allowing linguistic bias to suppress fine-grained visual details. To this end, the authors provide the first quantitative characterization of how language bias concentrates within the principal components of a general-purpose text subspace. They propose two debiasing strategies: an inference-time subspace projection requiring no training, and a bias-aware fine-tuning approach. Both methods significantly reduce hallucination rates on POPE, CHAIR, and AMBER benchmarks while improving CLAIR scores for long-form image descriptions. Notably, the inference-time strategy incurs no additional computational overhead.

📝 Abstract

Vision-Language Models (VLMs) increasingly power high-stakes applications, from medical imaging to autonomous systems, yet they routinely hallucinate, confidently describing content not present in the input. We investigate the root causes of these failure modes with a mechanistic analysis focusing on the decoder-based VLMs. We trace these failure modes to a geometric over-alignment: to bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. While prior work either aggressively closes this gap or suppresses hallucinations through expensive black-box decoding strategies, none addresses the underlying geometric cause. We provide the first quantitative characterization of this over-alignment, demonstrating that linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Building on this insight, we propose two complementary remedies: a training-free inference strategy and a bias-aware fine-tuning paradigm, both of which explicitly project out this subspace from visual representations. Our methods significantly reduce hallucinations across POPE, CHAIR, and AMBER benchmarks, and improve CLAIR scores on long-form captioning tasks, with the training-free variant adding no computational overhead over the base model.

Problem

Research questions and friction points this paper is trying to address.

hallucination

over-alignment

vision-language models

linguistic bias

geometric debiasing

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric over-alignment

vision-language models

linguistic bias