TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing video generation models achieve high visual fidelity but frequently produce physically implausible sequences—such as floating objects, teleportation, or acausal deformations—and lack reliable, quantitative benchmarks for assessing physical plausibility. Method: We propose TRAVL, a physics-aware training framework, and ImplausiBench, a novel evaluation benchmark built upon a balanced dataset of physically implausible videos. TRAVL introduces trajectory-aware attention and fine-grained motion encoding to enhance spatiotemporal and causal reasoning in video-language models. Evaluation employs a dual paradigm—human annotation and LLM-based adjudication—to mitigate linguistic bias. Contribution/Results: Our approach achieves human-level accuracy in detecting physical violations on ImplausiBench, marking the first systematic improvement in multimodal models’ visual-temporal discrimination of physical plausibility. This work establishes a new standard for trustworthy evaluation of video generation systems.

Technology Category

Application Category

📝 Abstract

Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.

Problem

Research questions and friction points this paper is trying to address.

Quantitatively assessing physical realism in generated videos

Improving Video-Language Models' temporal and causal reasoning

Creating benchmark for physics plausibility without linguistic biases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes VLMs with balanced dataset training

Adds trajectory-aware attention for motion encoding

Introduces ImplausiBench benchmark for rigorous evaluation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs