🤖 AI Summary
This work addresses the limited spatiotemporal grounding and genuine physical reasoning capabilities of current video understanding models, which often fail to precisely localize events in time and space despite answering physical questions. The authors introduce the first unified benchmark for physical video understanding, constructing 1,560 video clips from four datasets—including Something-Something V2 and YouCook2—and generating three prompt families (physical, vstar_like, and neutral_rstr) via shared event annotations. They evaluate models under four perturbation conditions—original, shuffled, ablated, and frame-masked—to assess spatiotemporal-semantic grounding. Extending the what–when–where evaluation framework to multi-source videos, multiple physical domains, and diverse prompt families, they propose a fine-grained diagnostic mechanism sensitive to both perturbations and prompt types. Experiments reveal that models perform best with physical prompts yet exhibit weakest spatial grounding and selective robustness across prompt types, underscoring the necessity of holistic evaluation encompassing physical grounding, prompt sensitivity, and perturbation response.
📝 Abstract
Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video Q&A reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.