Towards Understanding Multimodal Fine-Tuning: Spatial Features

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work investigates how the language backbone of vision-language models adapts to visual information during multimodal fine-tuning, with a focus on the emergence of spatial reasoning capabilities. To this end, the authors propose a staged model differencing approach that integrates controlled prompt perturbations, causal tracing of attention heads, and representational analysis, offering the first mechanistic dissection of the fine-tuning process. The study reveals that visual-preference features emerging during fine-tuning reliably encode spatial relationships and can be traced back to a small set of critical attention heads. These findings uncover the neural pathways through which textual representations are reshaped by visual signals, substantially enhancing the interpretability of multimodal training dynamics.

Technology Category

Application Category

📝 Abstract

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to"see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.

Problem

Research questions and friction points this paper is trying to address.

multimodal fine-tuning

vision-language models

spatial features

language model adaptation

visual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

stage-wise model diffing

vision-language models

spatial features