Line of Sight: On Linear Representations in VLLMs

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study investigates the linear decodability of image concepts and cross-modal representation sharing mechanisms within vision-language large models (VLLMs), using the open-source LLaVA-Next as a testbed. To address these questions, we employ linear probing, causal intervention experiments, and a novel multimodal sparse autoencoder (SAE) framework. Our method systematically identifies and validates over one hundred causally significant, linearly decodable features—each corresponding to an ImageNet class—within the residual stream. We make three key contributions: (1) We discover that image and text representations undergo progressive fusion in deeper layers, contradicting the conventional early-separation hypothesis; (2) We propose the first multimodal SAE explicitly designed for joint image-text modeling, yielding an interpretable cross-modal feature dictionary; (3) We demonstrate that LLaVA-Next achieves strong cross-modal semantic alignment, with high-level representations inherently supporting linear, causal, and interpretable image concept decoding.

Technology Category

Application Category

📝 Abstract

Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.

Problem

Research questions and friction points this paper is trying to address.

How multimodal models represent images in hidden activations

Exploring linear decodable features in VLLMs like LlaVA-Next

Investigating shared representations across modalities in deeper layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear decodable features in residual stream

Targeted edits on model output

Multimodal Sparse Autoencoders for interpretability

🔎 Similar Papers

Law of Vision Representation in MLLMs