Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing studies struggle to disentangle the dynamic reliance of multimodal large language models (MLLMs) on visual inputs versus world knowledge priors in visual question answering (VQA). Method: We propose a visual counterfactual evaluation paradigm and introduce Visual CounterFact—the first benchmark dataset for visual counterfactual reasoning. We design Pixels Versus Priors (PvP), an activation intervention vector applied at intermediate transformer layers to steer model outputs toward either pixel-level evidence or prior knowledge. Using layer-wise activation analysis and interpretability probes, we characterize how visual evidence progressively overrides priors in mid-to-late layers. Contribution/Results: PvP successfully shifts 92.5% of color predictions and 74.6% of size predictions from prior-driven to counterfactual visual inputs, significantly enhancing MLLMs’ responsiveness to actual visual evidence. Our framework enables fine-grained causal analysis of modality-specific contributions in MLLMs, offering new insights into their internal reasoning dynamics.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 92.5% of color and 74.6% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Investigates reliance on memorized knowledge versus visual input in MLLMs

Introduces Visual CounterFact dataset to test knowledge-prior conflicts

Proposes PvP steering to control model outputs between knowledge and visuals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual CounterFact dataset tests knowledge versus vision

PvP steering vectors control model output modality

Activation-level interventions shift predictions effectively

🔎 Similar Papers

VLind-Bench: Measuring Language Priors in Large Vision-Language Models