Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current text-to-video (T2V) models exhibit systematic deficits in physical commonsense reasoning, frequently generating videos that violate causal principles, object dynamics, and tool-use logic. Method: We introduce PhysVidBench—the first benchmark explicitly designed to evaluate physical commonsense in T2V generation, focusing on causality, object behavior, and tool-mediated interactions. It comprises 383 structured prompts and employs a three-stage indirect evaluation framework: (1) generating physics-grounded questions from prompts, (2) using vision-language models to describe generated videos, and (3) leveraging large language models for logical reasoning and answer generation—thereby mitigating hallucination bias inherent in direct assessment while improving interpretability and reliability. Contribution/Results: Experiments reveal consistent physical implausibility across leading T2V models. PhysVidBench is reproducible, extensible, and establishes a new paradigm for modeling and evaluating physical commonsense in T2V systems.

Technology Category

Application Category

📝 Abstract

Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language model to answer several physics-involved questions using only the caption. This indirect strategy circumvents common hallucination issues in direct video-based evaluation. By highlighting affordances and tool-mediated actions, areas overlooked in current T2V evaluations, PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating physical commonsense understanding in video generation models

Assessing tool use and material properties in synthesized videos

Benchmarking physical plausibility through indirect question-answering evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark PhysVidBench for physical commonsense evaluation

Three-stage pipeline: physics questions, video captioning, LM answers

Indirect evaluation strategy to avoid hallucination issues

🔎 Similar Papers

No similar papers found.