Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Insufficient physical consistency in video generation—particularly poor generalization to unseen motion conditions (e.g., varying velocities)—remains a key challenge. This paper proposes Phys-AR, a novel physics-aware autoregressive framework. First, it introduces the Diffusion-based Discrete Tokenizer (DDT), enabling reversible discrete modeling of visual attributes and explicit encoding of physical quantities. Second, it establishes a two-stage generation paradigm: (i) a diffusion model generates discrete visual tokens; (ii) a large language model performs symbolic physical reasoning, jointly optimized via reinforcement learning enforcing velocity and dynamical consistency. Phys-AR is the first method to unify symbolic reasoning, discrete visual tokenization, and physics-constrained RL within a single video generation pipeline. Experiments demonstrate significant improvements in dynamic physical plausibility across diverse unseen motion scenarios, outperforming state-of-the-art diffusion-based video models.

Technology Category

Application Category

📝 Abstract

Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation. We first introduce the Diffusion Timestep Tokenizer (DDT), which learns discrete, recursive visual tokens by recovering visual attributes lost during the diffusion process. The recursive visual tokens enable symbolic reasoning by a large language model. Based on it, we propose the Phys-AR framework, which consists of two stages: The first stage uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities through reward functions based on physical conditions. Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws. Experimental results demonstrate that PhysAR can generate videos that are physically consistent.

Problem

Research questions and friction points this paper is trying to address.

Enforcing physical consistency in video generation

Overcoming limitations of data-driven diffusion methods

Integrating symbolic reasoning with reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrate symbolic reasoning with reinforcement learning

Introduce Diffusion Timestep Tokenizer for visual tokens

Phys-AR framework optimizes reasoning via reinforcement learning

🔎 Similar Papers

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

2024-07-31arXiv.orgCitations: 21

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

Authors to Follow