Chain of Event-Centric Causal Thought for Physically Plausible Video Generation

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models struggle to capture the causal dynamics inherent in physical phenomena, often producing results that lack physical plausibility. To address this limitation, this work proposes an event-centric causal chain-of-thought framework that formulates video generation as the synthesis of a sequence of causally linked events. The approach enforces physics-based constraints during causal reasoning, decomposes complex dynamics into discrete event units, and introduces a transition-aware cross-modal prompting mechanism to ensure visual coherence. Furthermore, the framework supports interactive keyframe editing for user-guided control. Evaluated on the PhyGenBench and VideoPhy benchmarks, the proposed method significantly outperforms current state-of-the-art models, demonstrating superior causal consistency and physical plausibility across diverse physical scenarios.

Technology Category

Application Category

📝 Abstract
Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Our code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Physically Plausible Video Generation
Causal Reasoning
Event Chain
Video Diffusion Models
Commonsense Knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Reasoning
Causal Event Chain
Physics-constrained Generation
Transition-aware Prompting
Physically Plausible Video Generation
🔎 Similar Papers
No similar papers found.
Z
Zixuan Wang
Sichuan University
Y
Yixin Hu
Sichuan University
H
Haolan Wang
Sichuan University
F
Feng Chen
The University of Adelaide
Yan Liu
Yan Liu
Professor of GIScience, The Chinese University of Hong Kong
Urban AnalyticsGISCellular AutomataSpatial Big DataQuantitative Human Geography
Wen Li
Wen Li
Data Intelligence Group, UESTC
Machine LearningComputer VisionDomain AdaptationTransfer LearningWeb Data
Y
Yinjie Lei
Sichuan University