🤖 AI Summary
This work investigates the fundamental causes underlying the performance limitations of multimodal language models (MLMs) on perception-intensive tasks. Through zero-shot evaluation, feature visualization, and controlled ablation experiments, we systematically analyze the encoding, propagation, and activation mechanisms of visual key-value tokens in state-of-the-art models—including LLaVA-OneVision, Qwen2.5-VL, and Llama-3-LLaVA-NeXT. We identify three key findings: (1) input-agnostic visual key tokens introduced in later layers induce perceptual degradation; (2) text prefixes dynamically modulate visual representations, substantially enhancing perception; and (3) substantial internal visual information remains underutilized—33.3% of critical perceptual signals in the BLINK art-style task fail to activate, and fine-tuned MLMs exhibit weaker visual representations than the original SigLIP encoder. Crucially, we demonstrate for the first time that image value tokens alone suffice for zero-shot segmentation and semantic correspondence, revealing a core bottleneck: “sufficient encoding but insufficient utilization” of visual information.
📝 Abstract
Despite interpretability work analyzing VIT encoders and transformer activations, we don't yet understand why Multimodal Language Models (MLMs) struggle on perception-heavy tasks. We offer an under-studied perspective by examining how popular MLMs (LLaVA-OneVision, Qwen2.5-VL, and Llama-3-LLaVA-NeXT) process their visual key-value tokens. We first study the flow of visual information through the language model, finding that image value tokens encode sufficient information to perform several perception-heavy tasks zero-shot: segmentation, semantic correspondence, temporal correspondence, and referring expression detection. We find that while the language model does augment the visual information received from the projection of input visual encodings-which we reveal correlates with overall MLM perception capability-it contains less visual information on several tasks than the equivalent visual encoder (SigLIP) that has not undergone MLM finetuning. Further, we find that the visual information corresponding to input-agnostic image key tokens in later layers of language models contains artifacts which reduce perception capability of the overall MLM. Next, we discuss controlling visual information in the language model, showing that adding a text prefix to the image input improves perception capabilities of visual representations. Finally, we reveal that if language models were able to better control their visual information, their perception would significantly improve; e.g., in 33.3% of Art Style questions in the BLINK benchmark, perception information present in the language model is not surfaced to the output! Our findings reveal insights into the role of key-value tokens in multimodal systems, paving the way for deeper mechanistic interpretability of MLMs and suggesting new directions for training their visual encoder and language model components.