Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge that existing vision-language models struggle to capture physically grounded dynamic anomalies, particularly irregular rotations that violate mechanical motion laws. To overcome this limitation, the study introduces a novel approach that integrates structured physical priors—such as object properties, motion patterns, and dynamical constraints—into instruction tuning for multi-turn vision-language dialogue. By leveraging step-by-step prompting to guide causal reasoning, the model learns robust representations distinguishing normal from anomalous dynamics. Evaluated on the Phys-AD benchmark, the method achieves a video-level AUROC of 96.7%, substantially outperforming the previous state-of-the-art (66.9%), and attains a high-quality causal explanation score of 0.777 from large language model evaluation, demonstrating both high accuracy and interpretability in dynamic anomaly detection.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.

Problem

Research questions and friction points this paper is trying to address.

physics-grounded anomaly detection

vision-language models

dynamic anomalies

kinematic constraints

causal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Physics-informed

Vision-Language Model

Anomaly Detection