Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Insufficient physical consistency in video generation—particularly poor generalization to unseen motion conditions (e.g., varying velocities)—remains a key challenge. This paper proposes Phys-AR, a novel physics-aware autoregressive framework. First, it introduces the Diffusion-based Discrete Tokenizer (DDT), enabling reversible discrete modeling of visual attributes and explicit encoding of physical quantities. Second, it establishes a two-stage generation paradigm: (i) a diffusion model generates discrete visual tokens; (ii) a large language model performs symbolic physical reasoning, jointly optimized via reinforcement learning enforcing velocity and dynamical consistency. Phys-AR is the first method to unify symbolic reasoning, discrete visual tokenization, and physics-constrained RL within a single video generation pipeline. Experiments demonstrate significant improvements in dynamic physical plausibility across diverse unseen motion scenarios, outperforming state-of-the-art diffusion-based video models.

Technology Category

Application Category

📝 Abstract
Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation. We first introduce the Diffusion Timestep Tokenizer (DDT), which learns discrete, recursive visual tokens by recovering visual attributes lost during the diffusion process. The recursive visual tokens enable symbolic reasoning by a large language model. Based on it, we propose the Phys-AR framework, which consists of two stages: The first stage uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities through reward functions based on physical conditions. Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws. Experimental results demonstrate that PhysAR can generate videos that are physically consistent.
Problem

Research questions and friction points this paper is trying to address.

Enforcing physical consistency in video generation
Overcoming limitations of data-driven diffusion methods
Integrating symbolic reasoning with reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrate symbolic reasoning with reinforcement learning
Introduce Diffusion Timestep Tokenizer for visual tokens
Phys-AR framework optimizes reasoning via reinforcement learning
🔎 Similar Papers
No similar papers found.
Wang Lin
Wang Lin
Zhejiang University
Computer VisionMulti-Modal LearningVideo Understanding
Liyu Jia
Liyu Jia
Nanyang Technological University
Wentao Hu
Wentao Hu
PhD student, The Hong Kong Polytechnic University
Large Language ModelComputer Vision
Kaihang Pan
Kaihang Pan
Zhejiang University
nlpvision-and-language
Z
Zhongqi Yue
Nanyang Technological University
W
Wei Zhao
Huawei Singapore Research Center
J
Jingyuan Chen
Zhejiang University
F
Fei Wu
Zhejiang University
H
Hanwang Zhang
Nanyang Technological University