Causal Probing for Internal Visual Representations in Multimodal Large Language Models

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
How multimodal large language models encode diverse visual concepts remains unclear. This work proposes a causal probing framework based on activation steering to systematically intervene in internal model representations and uncover encoding mechanisms across four categories of visual concepts. The study reveals that concrete entity concepts are stored in localized neurons, whereas abstract concepts rely on global distributed representations, with model depth playing a critical role for the latter. It further identifies, for the first time, a compensatory mechanism between perception and generation, as well as a dissociation between perception and reasoning. The research also validates the impact of model scale on concept encoding and demonstrates that although geometric relationships are recognizable, they often fail to trigger procedural reasoning, leading to breakdowns in abstract reasoning tasks.
📝 Abstract
Despite the remarkable success of Multimodal Large Language Models (MLLMs) across diverse tasks, the internal mechanisms governing how they encode and ground distinct visual concepts remain poorly understood. To bridge this gap, we propose a causal framework based on activation steering to actively probe and manipulate internal visual representations. Through systematic intervention across four visual concept categories, our results reveal a divergence in concept encoding: entities exhibit distinct localized memorization, whereas abstract concepts are globally distributed across the network. Critically, this divergence uncovers a mechanistic driver of scaling laws: increasing model depth is indispensable for encoding distributed and complex abstract concepts, whereas entity localization remains remarkably invariant to scale. Furthermore, reverse steering uncovers that blocking explicit output triggers a surge in latent activations, exposing a compensatory mechanism between perception and generation. Finally, extending our analysis to visual reasoning, we expose a disconnect between perception and reasoning although MLLMs successfully recognize geometric relations, they treat them merely as static visual features, failing to trigger the procedural execution necessary for abstract problem-solving.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
visual representations
causal probing
concept encoding
perception-reasoning disconnect
Innovation

Methods, ideas, or system contributions that make the work stand out.

causal probing
activation steering
visual representation
scaling laws
multimodal reasoning