🤖 AI Summary
Current video large language models (Video-LLMs) demonstrate strong semantic understanding but lack the capacity for predictive modeling of physical dynamics. This work introduces HOCA-Bench, a novel benchmark that, for the first time, incorporates a Hegelian philosophical framework to categorize physical anomalies into ontological and causal types. Leveraging generative video models as adversarial simulators, the authors construct a fine-grained evaluation dataset comprising 1,439 videos and 3,470 question-answer pairs. Experiments across 17 state-of-the-art Video-LLMs reveal that while models perform reasonably well in detecting ontological anomalies, their accuracy drops by over 20% on questions involving causal mechanisms—such as gravity and friction. This performance gap persists even when System-2 reasoning is explicitly invoked, underscoring a fundamental limitation in current models’ ability to reason about physical causality.
📝 Abstract
Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video models as adversarial simulators, we build a testbed of 1,439 videos (3,470 QA pairs). Evaluations on 17 Video-LLMs show a clear cognitive lag: models often identify static ontological violations (e.g., shape mutations) but struggle with causal mechanisms (e.g., gravity or friction), with performance dropping by more than 20% on causal tasks. System-2 "Thinking" modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.