🤖 AI Summary
Current text-to-video (T2V) models exhibit systematic deficits in physical commonsense reasoning, frequently generating videos that violate causal principles, object dynamics, and tool-use logic.
Method: We introduce PhysVidBench—the first benchmark explicitly designed to evaluate physical commonsense in T2V generation, focusing on causality, object behavior, and tool-mediated interactions. It comprises 383 structured prompts and employs a three-stage indirect evaluation framework: (1) generating physics-grounded questions from prompts, (2) using vision-language models to describe generated videos, and (3) leveraging large language models for logical reasoning and answer generation—thereby mitigating hallucination bias inherent in direct assessment while improving interpretability and reliability.
Contribution/Results: Experiments reveal consistent physical implausibility across leading T2V models. PhysVidBench is reproducible, extensible, and establishes a new paradigm for modeling and evaluating physical commonsense in T2V systems.
📝 Abstract
Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language model to answer several physics-involved questions using only the caption. This indirect strategy circumvents common hallucination issues in direct video-based evaluation. By highlighting affordances and tool-mediated actions, areas overlooked in current T2V evaluations, PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.