🤖 AI Summary
Insufficient physical consistency in video generation—particularly poor generalization to unseen motion conditions (e.g., varying velocities)—remains a key challenge. This paper proposes Phys-AR, a novel physics-aware autoregressive framework. First, it introduces the Diffusion-based Discrete Tokenizer (DDT), enabling reversible discrete modeling of visual attributes and explicit encoding of physical quantities. Second, it establishes a two-stage generation paradigm: (i) a diffusion model generates discrete visual tokens; (ii) a large language model performs symbolic physical reasoning, jointly optimized via reinforcement learning enforcing velocity and dynamical consistency. Phys-AR is the first method to unify symbolic reasoning, discrete visual tokenization, and physics-constrained RL within a single video generation pipeline. Experiments demonstrate significant improvements in dynamic physical plausibility across diverse unseen motion scenarios, outperforming state-of-the-art diffusion-based video models.
📝 Abstract
Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation. We first introduce the Diffusion Timestep Tokenizer (DDT), which learns discrete, recursive visual tokens by recovering visual attributes lost during the diffusion process. The recursive visual tokens enable symbolic reasoning by a large language model. Based on it, we propose the Phys-AR framework, which consists of two stages: The first stage uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities through reward functions based on physical conditions. Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws. Experimental results demonstrate that PhysAR can generate videos that are physically consistent.