🤖 AI Summary
The internal mechanisms underlying spatiotemporal semantic representation in Large Vision-Language Models (LVLMs) remain poorly understood. To address this, we propose the first circuit-level analytical framework that integrates visual auditing, semantic tracing, and attention flow modeling to systematically characterize the dynamic evolution of object- and action-related representations across mid-to-late-layer visual tokens. We find that spatiotemporal semantics are highly localized to specific object tokens, and that concept refinement and functional specialization emerge distinctly in these layers. Through circuit tracing and targeted ablation experiments, removal of critical object tokens degrades model performance by up to 92.6%, empirically confirming their representational centrality. This work provides the first mechanistic account of spatiotemporal understanding in LVLMs at the circuit level, establishing a foundational basis for interpretable modeling and robust architectural design.
📝 Abstract
The processing mechanisms underlying language and image understanding in large vision-language models (LVLMs) have been extensively studied. However, the internal reasoning mechanisms of LVLMs for spatiotemporal understanding remain poorly understood. In this work, we introduce a systematic, circuit-based framework designed to investigate how spatiotemporal visual semantics are represented and processed within these LVLMs. Specifically, our framework comprises three circuits: visual auditing circuit, semantic tracing circuit, and attention flow circuit. Through the lens of these circuits, we discover that visual semantics are highly localized to specific object tokens--removing these tokens can degrade model performance by up to 92.6%. Furthermore, we identify that interpretable concepts of objects and actions emerge and become progressively refined in the middle-to-late layers of LVLMs. In contrary to the current works that solely focus on objects in one image, we reveal that the middle-to-late layers of LVLMs exhibit specialized functional localization for spatiotemporal semantics. Our findings offer significant mechanistic insights into spatiotemporal semantics analysis of LVLMs, laying a foundation for designing more robust and interpretable models.