CircuitProbe: Dissecting Spatiotemporal Visual Semantics with Circuit Tracing

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The internal mechanisms underlying spatiotemporal semantic representation in Large Vision-Language Models (LVLMs) remain poorly understood. To address this, we propose the first circuit-level analytical framework that integrates visual auditing, semantic tracing, and attention flow modeling to systematically characterize the dynamic evolution of object- and action-related representations across mid-to-late-layer visual tokens. We find that spatiotemporal semantics are highly localized to specific object tokens, and that concept refinement and functional specialization emerge distinctly in these layers. Through circuit tracing and targeted ablation experiments, removal of critical object tokens degrades model performance by up to 92.6%, empirically confirming their representational centrality. This work provides the first mechanistic account of spatiotemporal understanding in LVLMs at the circuit level, establishing a foundational basis for interpretable modeling and robust architectural design.

Technology Category

Application Category

📝 Abstract
The processing mechanisms underlying language and image understanding in large vision-language models (LVLMs) have been extensively studied. However, the internal reasoning mechanisms of LVLMs for spatiotemporal understanding remain poorly understood. In this work, we introduce a systematic, circuit-based framework designed to investigate how spatiotemporal visual semantics are represented and processed within these LVLMs. Specifically, our framework comprises three circuits: visual auditing circuit, semantic tracing circuit, and attention flow circuit. Through the lens of these circuits, we discover that visual semantics are highly localized to specific object tokens--removing these tokens can degrade model performance by up to 92.6%. Furthermore, we identify that interpretable concepts of objects and actions emerge and become progressively refined in the middle-to-late layers of LVLMs. In contrary to the current works that solely focus on objects in one image, we reveal that the middle-to-late layers of LVLMs exhibit specialized functional localization for spatiotemporal semantics. Our findings offer significant mechanistic insights into spatiotemporal semantics analysis of LVLMs, laying a foundation for designing more robust and interpretable models.
Problem

Research questions and friction points this paper is trying to address.

Understanding spatiotemporal semantics in LVLMs
Localizing visual semantics to specific object tokens
Analyzing functional layers for object-action refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Circuit-based framework for spatiotemporal semantics
Three circuits: visual, semantic, attention flow
Middle-to-late layers refine object-action concepts
🔎 Similar Papers
No similar papers found.
Y
Yiming Zhang
HFIPS, Chinese Academy of Sciences
C
Chengzhang Yu
South China University of Technology
Zhuokai Zhao
Zhuokai Zhao
Research Scientist, Meta AI
LLM AgentsMultimodal LLM ReasoningData-Efficient Learning
K
Kun Wang
Nanyang Technological University
Qiankun Li
Qiankun Li
Research Fellow@NTU, Ph.D.@USTC
MLLMAI4HealthComputer VisionPattern RecognitionTrustworthy AI
Z
Zihan Chen
HFIPS, Chinese Academy of Sciences
Y
Yang Liu
Nanyang Technological University
Z
Zenghui Ding
HFIPS, Chinese Academy of Sciences
Yining Sun
Yining Sun
Johns Hopkins University
Computer Vision