Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-video (T2V) models exhibit systematic deficits in physical commonsense reasoning, frequently generating videos that violate causal principles, object dynamics, and tool-use logic. Method: We introduce PhysVidBench—the first benchmark explicitly designed to evaluate physical commonsense in T2V generation, focusing on causality, object behavior, and tool-mediated interactions. It comprises 383 structured prompts and employs a three-stage indirect evaluation framework: (1) generating physics-grounded questions from prompts, (2) using vision-language models to describe generated videos, and (3) leveraging large language models for logical reasoning and answer generation—thereby mitigating hallucination bias inherent in direct assessment while improving interpretability and reliability. Contribution/Results: Experiments reveal consistent physical implausibility across leading T2V models. PhysVidBench is reproducible, extensible, and establishes a new paradigm for modeling and evaluating physical commonsense in T2V systems.

Technology Category

Application Category

📝 Abstract
Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language model to answer several physics-involved questions using only the caption. This indirect strategy circumvents common hallucination issues in direct video-based evaluation. By highlighting affordances and tool-mediated actions, areas overlooked in current T2V evaluations, PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating physical commonsense understanding in video generation models
Assessing tool use and material properties in synthesized videos
Benchmarking physical plausibility through indirect question-answering evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark PhysVidBench for physical commonsense evaluation
Three-stage pipeline: physics questions, video captioning, LM answers
Indirect evaluation strategy to avoid hallucination issues
🔎 Similar Papers
No similar papers found.
E
Enes Sanli
Department of Computer Engineering, Koç University, Istanbul, Turkey
B
Baris Sarper Tezcan
Department of Computer Engineering, Koç University, Istanbul, Turkey
Aykut Erdem
Aykut Erdem
Associate Professor of Computer Science, Koç University, Istanbul, Turkey
Computer VisionNatural Language ProcessingMachine LearningArtificial Intelligence
Erkut Erdem
Erkut Erdem
Professor of Computer Science, Hacettepe University
Computer VisionMachine Learning