🤖 AI Summary
This work investigates how the language backbone of vision-language models adapts to visual information during multimodal fine-tuning, with a focus on the emergence of spatial reasoning capabilities. To this end, the authors propose a staged model differencing approach that integrates controlled prompt perturbations, causal tracing of attention heads, and representational analysis, offering the first mechanistic dissection of the fine-tuning process. The study reveals that visual-preference features emerging during fine-tuning reliably encode spatial relationships and can be traced back to a small set of critical attention heads. These findings uncover the neural pathways through which textual representations are reshaped by visual signals, substantially enhancing the interpretability of multimodal training dynamics.
📝 Abstract
Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to"see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.