Line of Sight: On Linear Representations in VLLMs

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the linear decodability of image concepts and cross-modal representation sharing mechanisms within vision-language large models (VLLMs), using the open-source LLaVA-Next as a testbed. To address these questions, we employ linear probing, causal intervention experiments, and a novel multimodal sparse autoencoder (SAE) framework. Our method systematically identifies and validates over one hundred causally significant, linearly decodable features—each corresponding to an ImageNet class—within the residual stream. We make three key contributions: (1) We discover that image and text representations undergo progressive fusion in deeper layers, contradicting the conventional early-separation hypothesis; (2) We propose the first multimodal SAE explicitly designed for joint image-text modeling, yielding an interpretable cross-modal feature dictionary; (3) We demonstrate that LLaVA-Next achieves strong cross-modal semantic alignment, with high-level representations inherently supporting linear, causal, and interpretable image concept decoding.

Technology Category

Application Category

📝 Abstract
Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.
Problem

Research questions and friction points this paper is trying to address.

How multimodal models represent images in hidden activations
Exploring linear decodable features in VLLMs like LlaVA-Next
Investigating shared representations across modalities in deeper layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear decodable features in residual stream
Targeted edits on model output
Multimodal Sparse Autoencoders for interpretability
🔎 Similar Papers
No similar papers found.