Same Answer, Different Representations: Hidden instability in VLMs

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses a critical gap in the robustness evaluation of vision-language models (VLMs), which has predominantly focused on output consistency while neglecting the stability of internal representations, potentially leading to misleading assessments. To remedy this, the authors propose a novel evaluation framework that integrates representation-aware and frequency-domain analyses, introducing three metrics: embedding drift, spectral sensitivity, and spatial consistency of visual tokens. Systematic experiments on SEED-Bench, MMMU, and POPE reveal three key insights: internal representations can undergo substantial drift even when outputs remain unchanged; increasing model scale does not necessarily enhance robustness; and perturbations exert opposing effects on reasoning versus hallucination tasks. These findings challenge prevailing evaluation paradigms and offer a new perspective for assessing VLM robustness through the lens of internal dynamics.

Technology Category

Application Category

📝 Abstract

The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.

Problem

Research questions and friction points this paper is trying to address.

Vision Language Models

robustness

representation drift

multimodal processing

internal instability

Innovation

Methods, ideas, or system contributions that make the work stand out.

representation drift

spectral sensitivity

structural smoothness