Do Joint Audio-Video Generation Models Understand Physics?

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work investigates whether existing audio-visual generation models genuinely understand physical laws or merely produce superficially plausible yet physically inconsistent outputs. To address this, we introduce the first physics-aware evaluation framework for audio-visual generation, establishing the AV-Phys Bench benchmark that encompasses three scenario types: steady-state, event transitions, and environmental transitions. We further propose anti-physics prompts (Anti-AV-Physics) to probe model robustness. Our AV-Phys Agent integrates multimodal large language models, acoustic measurement tools, and ReAct-style reasoning, coupled with a human-aligned five-dimensional evaluation protocol to enable automated assessment. Experiments reveal that state-of-the-art models, including Seedance 2.0, exhibit significant fragility in cross-modal physical consistency—particularly in dynamic transitions and counterfactual scenarios—highlighting critical challenges in this emerging domain.

📝 Abstract

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.

Problem

Research questions and friction points this paper is trying to address.

audio-video generation

physical commonsense

cross-modal consistency

scene dynamics

AV-Phys

Innovation

Methods, ideas, or system contributions that make the work stand out.

AV-Phys Bench

joint audio-video generation

physical commonsense