🤖 AI Summary
This work addresses the challenge that existing vision-language models struggle to capture physically grounded dynamic anomalies, particularly irregular rotations that violate mechanical motion laws. To overcome this limitation, the study introduces a novel approach that integrates structured physical priors—such as object properties, motion patterns, and dynamical constraints—into instruction tuning for multi-turn vision-language dialogue. By leveraging step-by-step prompting to guide causal reasoning, the model learns robust representations distinguishing normal from anomalous dynamics. Evaluated on the Phys-AD benchmark, the method achieves a video-level AUROC of 96.7%, substantially outperforming the previous state-of-the-art (66.9%), and attains a high-quality causal explanation score of 0.777 from large language model evaluation, demonstrating both high accuracy and interpretability in dynamic anomaly detection.
📝 Abstract
Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.