🤖 AI Summary
This study investigates the linear decodability of image concepts and cross-modal representation sharing mechanisms within vision-language large models (VLLMs), using the open-source LLaVA-Next as a testbed. To address these questions, we employ linear probing, causal intervention experiments, and a novel multimodal sparse autoencoder (SAE) framework. Our method systematically identifies and validates over one hundred causally significant, linearly decodable features—each corresponding to an ImageNet class—within the residual stream. We make three key contributions: (1) We discover that image and text representations undergo progressive fusion in deeper layers, contradicting the conventional early-separation hypothesis; (2) We propose the first multimodal SAE explicitly designed for joint image-text modeling, yielding an interpretable cross-modal feature dictionary; (3) We demonstrate that LLaVA-Next achieves strong cross-modal semantic alignment, with high-level representations inherently supporting linear, causal, and interpretable image concept decoding.
📝 Abstract
Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.